TopoFlow

Abstract

Paper Summary

We propose TopoFlow, a physics-informed Vision Transformer for multi-pollutant air quality forecasting over China. TopoFlow extends the ClimaX architecture with two key innovations: (1) wind-following patch reordering that aligns the sequential processing of spatial patches with atmospheric transport direction, and (2) topography-aware attention bias that encodes elevation barriers as a learnable inductive bias in the first transformer block.

The model operates on a 128 × 256 grid at 0.25° resolution covering mainland China and Taiwan, jointly forecasting six pollutants (PM_2.5, PM₁₀, SO₂, NO₂, CO, O₃) at four temporal horizons (12h, 24h, 48h, 96h). Trained on the Chinese Air Quality Reanalysis (CAQRA) dataset from 2013–2016, validated on 2017, and tested on 2018, TopoFlow is compared against CAMS global forecasts, Microsoft Aurora, the ClimaX baseline, and independent OpenAQ station measurements.

The learnable parameter α controls the strength of topographic resistance in the elevation bias, physically encoding mountain blocking effects on pollutant transport — a phenomenon critical in regions like the Sichuan Basin, Tarim Basin, and North China Plain.

Read the Full Paper on arXiv

Methodology

Key Innovations

Two physics-informed inductive biases for atmospheric transport modeling, integrated into a pre-trained ClimaX backbone.

Wind-Following Patch Reordering

Standard ViTs process patches in raster-scan order, which is agnostic to physical transport. TopoFlow reorders patches from upwind to downwind using 16 pre-computed wind sectors, creating an inductive bias aligned with atmospheric advection.

\pi_k^{(s)} = \cos(\theta_s) \cdot \frac{c_k}{W-1} - \sin(\theta_s) \cdot \frac{r_k}{H-1}

Patches are sorted by ascending projection $\pi_k^{(s)}$ along the dominant wind direction $\theta_s$. The wind direction is computed from magnitude-weighted averaging of $(u, v)$ wind components.

Alpha Elevation Bias

The first transformer block (Block 0) incorporates a learnable elevation bias that penalizes attention between patches when the destination patch is at higher elevation, encoding topographic barriers.

\mathbf{B}_{\text{elev}}[i,j] = -\alpha \cdot \text{ReLU}\!\left(\frac{z_j - z_i}{z_0}\right)

The learnable parameter $\alpha$ (initialized at 2.0) controls topographic resistance strength. $z_0 = 1000$ m is a normalization constant. ReLU ensures only uphill transport is penalized.

Relative 2D Position Bias

Following T5-style bucketing, relative spatial positions are discretized into 32 × 32 buckets with logarithmic binning. This reduces memory from $\mathcal{O}(N^2)$ to $\mathcal{O}(M^2)$ learnable parameters while encoding spatial proximity.

\mathbf{A} = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_h}} + \mathbf{B}_{\text{pos}} + \mathbf{B}_{\text{elev}}\right)

The final attention scores combine raw dot-product similarity, position bias, and elevation bias before softmax.

Multi-Pollutant Forecasting

Each of 15 input variables is embedded independently via variable-specific Conv2D projections, then aggregated through cross-attention. A 2-layer MLP decoder predicts 6 pollutants × 4 horizons simultaneously from a single forward pass.

\hat{\mathbf{Y}} \in \mathbb{R}^{B \times 6 \times 128 \times 256}

Training uses MSE loss with geographic masking restricted to mainland China and Taiwan (~45% of the grid).

52M

Parameters

Pollutants

Forecast Horizons

128

GPUs (MI250X)

0.25°

Resolution

Evaluation

Results & Case Studies

TopoFlow validated across extreme pollution events, seasonal patterns, and multi-model comparisons on the 2018 test set and independent OpenAQ station data.

Animated Predictions

TopoFlow Predictions vs CAQRA Ground Truth

Side-by-side animation of TopoFlow 12h forecasts compared to CAQRA reanalysis across the 2018 test set. Select a pollutant to see spatial pattern evolution.

Overview

PM_2.5 Distribution over China

Annual mean PM_2.5 concentrations across the study domain, highlighting the North China Plain, Sichuan Basin, and Yangtze River Delta as major pollution hotspots. The 128 × 256 grid at 0.25° resolution covers mainland China and Taiwan.

Multi-Model Comparison

Seasonal Validation: CAMS vs Aurora vs CAQRA vs TopoFlow vs OpenAQ

PM_2.5 spatial distribution across Winter (Jan 15), Spring (Mar 1), Summer (Jul 12), and Autumn (Oct 19) 2019. Five rows compare CAMS global forecasts, Microsoft Aurora, CAQRA reanalysis (ground truth), TopoFlow predictions, and independent OpenAQ station measurements. TopoFlow captures fine-scale spatial structure absent from CAMS global forecasts and better resolves seasonal patterns than Aurora.

Case Study

Beijing Haze Episode (Nov 2018)

Back-trajectory analysis of a major trans-boundary pollution event. PM_2.5 snapshots from Nov 8–16 show the plume building from Central China and converging on Beijing, reaching >200 μg/m³. TopoFlow captures the temporal evolution and spatial extent of the event.

Topographic Blocking

Sichuan Basin Case Study

Topographic blocking effect in the Sichuan Basin (Jul 2018). The Tibetan Plateau acts as a barrier, trapping pollutants in the basin. TopoFlow's elevation bias captures this mechanism — the transect at 30°N shows PM_2.5 accumulation matching observed patterns, unlike ClimaX which smooths across the barrier.

3D Visualization

PM_2.5 Prediction Surface with OpenAQ Station Validation

Three-dimensional visualization of TopoFlow's predicted PM_2.5 surface over the Sichuan Basin region, with independent OpenAQ station measurements overlaid as red markers. The surface captures the spatial gradient from elevated concentrations in the basin interior (trapped by topography) to lower values at the plateau edges.

Mathematical Framework

Core Equations

The mathematical formulation of TopoFlow's physics-informed attention mechanism.

TopoFlow Attention (Block 0)

The first transformer block augments standard attention with position and elevation biases:

$$\mathbf{A} = \text{softmax}\!\left(\frac{\mathbf{QK}^\top}{\sqrt{d_h}} + \mathbf{B}_{\text{pos}} + \mathbf{B}_{\text{elev}}\right)$$

Elevation Barrier Bias

Penalizes uphill pollutant transport across topographic barriers:

$$\mathbf{B}_{\text{elev}}[i,j] = -\alpha \cdot \text{ReLU}\!\left(\frac{z_j - z_i}{1000}\right)$$

Wind Direction Estimation

Magnitude-weighted average of horizontal wind components:

$$\theta_{\text{wind}} = \arctan2\!\left(\frac{\sum u \cdot w}{\sum w},\; \frac{\sum v \cdot w}{\sum w}\right)$$

Geographic Loss Function

MSE computed only over valid China + Taiwan regions:

$$\mathcal{L} = \frac{1}{\|\mathbf{M}\|_1} \sum_{v,i,j} \mathbf{M}_{ij} \cdot (\hat{Y}_{v,i,j} - Y_{v,i,j})^2$$

Technical Details

Model Specifications

Complete technical specifications of the TopoFlow architecture and training configuration.

Component	Specification
Input Resolution	128 × 256 (0.25° grid)
Patch Size	2 × 2 pixels (8,192 patches)
Embedding Dimension	768
Transformer Depth	6 blocks (1 TopoFlow + 5 standard ViT)
Attention Heads	8 (per-head dim = 96)
MLP Ratio	4.0 (hidden dim = 3,072)
Total Parameters	~52M
Input Variables	15 (5 meteo + 6 pollutants + 2 coords + 2 static)
Output Variables	6 pollutants (PM_2.5, PM₁₀, SO₂, NO₂, CO, O₃)
Forecast Horizons	12h, 24h, 48h, 96h
Wind Sectors	16 pre-computed orderings (22.5° each)
Position Bias Buckets	32 × 32 = 1,024 per head
α Initialization	2.0 (learnable, converges to ~1.0)

Training Parameter	Value
Optimizer	AdamW (weight decay = 0.01)
Base Learning Rate	1 × 10^-4 (cosine annealing + 2k warmup)
Effective Batch Size	512 (128 GPUs × 2 × 2 grad accum)
Training Data	2013–2016 (CAQRA)
Validation	2017
Test Set	2018
Epochs	60
Infrastructure	128 AMD MI250X GPUs (16 LUMI-G nodes)
Framework	PyTorch Lightning + DDP

Getting Started

Quick Start

Install TopoFlow and run inference in a few lines of code.

train.py

# Install
pip install git+https://github.com/AmmarKheder/TopoFlow.git

# Import
from topoflow import TopoFlowModel, TopoFlowDataModule

# Initialize model
model = TopoFlowModel(
    variables=["u", "v", "temp", "rh", "psfc",
               "pm25", "pm10", "so2", "no2", "co", "o3",
               "lat2d", "lon2d", "elevation", "population"],
    img_size=(128, 256),
    embed_dim=768,
    depth=6,
    num_heads=8,
    parallel_patch_embed=True,   # wind-following
    use_physics_mask=True,       # elevation bias
)

# Run inference
predictions = model(x, lead_times=[12, 24, 48, 96])
# Output: [B, 6, 128, 256] per horizon
            

submit_train.sh

#!/bin/bash
#SBATCH --nodes=16 --gpus-per-node=8 --time=48:00:00
# Distributed training on LUMI-G (128 AMD MI250X GPUs)

srun python scripts/train.py \
    --config configs/default.yaml \
    --data.train_years 2013 2014 2015 2016 \
    --data.val_years 2017 \
    --model.embed_dim 768 \
    --model.depth 6
            

Paper Summary

Architecture