Statistical Modeling
This page is the working home for the modeling side of the PM$_{2.5}$ forecasting section. It is meant to sit between the literature review and the live demo workflow.
Purpose
This section narrows the forecasting problem from broad literature review into a concrete statistical modeling workflow:
- define what is being predicted
- define what information is available at forecast time
- define the baseline models that must be beaten
- define how predicted values will be compared against measured concentration
Connecting Literature to Practice: The ML Approach
As outlined in the literature review, major government agencies rely on Chemical Transport Models (CTMs) like CMAQ or WRF-Chem. While powerful, CTMs have three major operational constraints for local practitioners:
- They require immense supercomputing resources.
- They run slowly (usually only updated twice a day).
- They absolutely require Live Emission Inventories (highly accurate, real-time spatial data on factory output, traffic volume, and agricultural fires), which are often completely unavailable in developing or rapidly growing regions.
This is where Statistical Machine Learning steps in. By using recent PM$_{2.5}$ momentum as an anchor and applying non-linear models (like XGBoost or MLPs) to meteorological forecasts (wind, rain, inversions), ML models act as “weather-driven dispersion proxies.” We cannot predict exactly what will be emitted, but we can mathematically predict how the weather will clear or trap whatever is already in the air.
The “Predictability Ceiling”
When relying purely on weather data and historical PM$_{2.5}$ without explicit emission data, Statistical ML hits a mathematical “Predictability Ceiling.”
Through extensive experimentation in Hanoi, we found that purely meteorological forecasting achieves a hard limit around RMSE ~20 $\mu g/m^3$ at T+24h. Weather tells us how pollutants disperse (e.g., “wind speeds are dropping, so whatever is emitted will be trapped”). But without spatial emission inventories, the model must guess the actual emission volume based entirely on past behavior. To push error margins below this ceiling and create highly deterministic public health oracles, the next generation of modeling must combine weather-based ML with spatial upstream sensors and local emission metrics.
Target
The main forecasting target is hourly PM$_{2.5}$ concentration at a fixed monitoring location in Hanoi.
Typical targets for this section:
T+1hT+6hT+12hT+24hT+48hT+72h
Model Families
The current forecasting work is centered on tabular statistical and machine-learning models rather than full chemical transport modeling.
Core families of interest:
- persistence
- linear and autoregressive baselines
- tree ensembles such as
XGBoost - comparison against sequence models when needed
Input Design
The modeling pipeline separates input groups so they can be tested cleanly:
- PM$_{2.5}$ history
- meteorological forecast inputs
- temporal features
- atmospheric mixing features such as boundary-layer proxies or reanalysis variables
The key modeling rule is that forecast-time inputs should reflect what is actually available at prediction time.
Evaluation
The main evaluation questions in this section are:
- how close are predicted concentrations to measured concentrations?
- how much better is the model than persistence?
- how does error change across horizons?
- how well does the model behave during high-PM episodes?
Primary metrics:
RMSEMAER2NRMSE- skill score vs persistence
Demo Link
The operational comparison page for this section is:
That page is intended to show predicted numbers directly against measured concentration so the model can be reviewed as a forecasting tool rather than only as a metrics table.