Hanoi Forecasting Results vs Literature
This page turns a draft comparison memo into a reader-facing summary. It positions the Hanoi forecasting experiments relative to the reviewed PM$_{2.5}$ forecasting literature without treating heterogeneous studies as strictly comparable.
Bottom Line
The Hanoi experiments fit the broad literature pattern in three ways:
XGBoostis a credible single-station baseline.- persistence remains difficult to beat at very short horizons.
- longer-horizon improvement comes more from meteorology and mixing physics than from extra temporal heuristics.
They also contribute two useful negative or clarifying results:
- sparse
IGRAinversion features did not help forecast accuracy - severe-event detection remained weak even when continuous forecast skill was acceptable
Where the Hanoi Results Sit
Short Horizon
The T+1h Hanoi model reached $R^2$ 0.825 and NRMSE near the middle of the literature range for practical single-station models. That is directionally consistent with published XGBoost and LSTM work, though direct comparison is limited by different regions, emissions, and validation protocols.
Medium Horizon
At T+24h, Hanoi’s best model used MERRA-2 atmospheric features and achieved RMSE 21.76 $\mu g/m^3$. In absolute terms that sits inside the broad range reported by many urban PM$_{2.5}$ studies, but absolute RMSE alone is not enough to claim parity or superiority across cities.
The more defensible comparison is qualitative:
- the error increased gradually with horizon instead of collapsing abruptly
- the model still beat persistence
- atmospheric mixing variables mattered most once the lead time moved past the immediate recent-state regime
Comparison by Theme
1. Model Choice
The literature review supports XGBoost as one of the strongest tabular single-station baselines. The Hanoi study confirms that choice was sensible:
- good performance without an oversized model stack
- fast training
- easy feature-block ablation
- interpretable feature rankings
This does not prove XGBoost is universally best. It does show that for a one-station operational-style workflow, it is a strong default.
2. Forecast Horizon Design
Many PM$_{2.5}$ papers publish only one horizon, usually around T+24h. The Hanoi study is stronger methodologically because it evaluates:
T+1hT+6hT+12hT+24hT+48hT+72h
That makes the horizon-specific tradeoffs clearer than a one-number benchmark paper.
3. Atmospheric Feature Use
The reviewed literature suggests boundary-layer physics is relevant but often underused in station-level ML studies. The Hanoi results support that interpretation:
MERRA-2added small but repeatable value- the gain was modest rather than transformational
- the most valuable signals were tied to mixing depth and stability
This is a more careful claim than saying atmospheric features “solve” the forecasting problem.
4. Inversion Features
Upper-air inversion logic is physically attractive, but the Hanoi study shows that sparsity can erase practical value. In this setup, twice-daily radiosondes were too coarse for consistent forecast improvement.
That is useful because it sets a realistic expectation: not every physically meaningful measurement becomes an operationally useful feature.
5. Severe Event Detection
This is where the Hanoi experiments align with the literature’s biggest unsolved problem.
The models forecasted continuous PM$_{2.5}$ reasonably well, but they still struggled to identify rare severe events at high precision. The best result was at T+6h, where recall at 90% precision remained low.
That is consistent with a broader pattern:
- regression models optimize common values
- severe events are rare
- public-warning use cases need more than ordinary RMSE gains
What Is New or Most Useful Here
Within the limits of a single-station study, the most useful contributions are:
- A transparent multi-horizon experiment instead of a single
T+24hresult. - A clean ablation showing that
MERRA-2helps a little andIGRAdoes not. - A practical warning that temporal heuristics can overfit.
- A realistic severe-event assessment instead of relying on average error alone.
What Should Not Be Overclaimed
The comparison remains bounded by several constraints:
- literature studies use different cities and pollution regimes
- train/test designs vary widely
- some papers use observed future weather rather than archived forecast inputs
- the Hanoi study is single-station and site-specific
So the right conclusion is not that Hanoi “beats the literature.” The right conclusion is that the Hanoi results are broadly credible within the literature’s mid-range single-station pattern, while adding a few well-documented feature ablations and negative results.
Practical Interpretation
If someone wanted to build a production-oriented version of this system, the literature comparison points toward a clear path:
- keep
XGBoostor a comparable boosted-tree model as the tabular baseline - preserve explicit persistence benchmarking
- replace
MERRA-2reanalysis with live forecast equivalents - treat severe-event warning as a separate modeling problem, not an automatic byproduct of low RMSE
Related Pages
Source Note
This page is adapted from the local draft tmp/air-quality-analysis-upstream/doc-work/hanoi-pm25-experiments-literature-comparison.md.