Hanoi PM$_{2.5}$ AI-Assisted Modeling Update (2018)

$AI-assisted PM$_{2.5}$ modeling update, Hanoi 2018$

This page is intentionally separate from the 2026 forecasting experiments. It summarizes an AI-assisted reworking of an older Hanoi PM$_{2.5}$ dataset framed as a model-improvement exercise on the 2018 period rather than as the main 2026 operational forecasting study.

What This Entry Is

This graphic compares several model variants on the older Hanoi dataset and asks a narrower question:

given a fixed historical PM$_{2.5}$ dataset
how much can prediction error be reduced
by moving from simple baselines to richer lag and feature-engineering setups

That makes it a modeling update on an older dataset, not the same thing as the later multi-horizon forecasting study built around operational-style forecast inputs.

Why It Is Separate

The 2026 forecasting page is about:

explicit forecast horizons
operationally framed weather inputs
train/validation splits extending into 2025
comparison against persistence for a deployment-style use case

This 2018 entry is about:

model-improvement iteration on a historical dataset
richer lag features and blending
a tighter question of error reduction inside one data regime

Reading the Figure

The strongest result in the chart is the Blend rich lags configuration, which pushes error down substantially relative to the mean baseline and simple weather-only models.

The broad progression shown in the figure is:

mean baseline performs worst
adding weather helps
richer non-linear models help more
lag-heavy and blended models improve further
chronological testing remains the more realistic check

Model Metrics in the Figure

The plotted figure comes from the later 3.3 rework notebook, which reports all three headline metrics on a common random split, plus one chronological hold-out check.

Table ordered from worst to best RMSE (showing the progression of model improvements):

Label in figure	Split	RMSE ($\mu g/m^3$)	MAE ($\mu g/m^3$)	R$^2$
Mean baseline	random	32.0	23.3	-0.001
Linear weather	random	25.9	18.2	0.345
Random Forest	random	20.0	13.3	0.610
HGB weather	random	19.5	13.0	0.629
HGB weather+time	random	14.6	9.1	0.793
HGB rich lags chronological	chronological	14.5	8.5	0.814
HGB rich lags	random	10.8	6.8	0.883
Blend rich lags	random	10.7	6.7	0.886

Key observations:

Blend rich lags achieves the lowest error (10.7 $\mu g/m^3$), representing a 66.6% improvement over the mean baseline.
Random Forest (20.0 $\mu g/m^3$) performs slightly worse than HGB weather (19.5 $\mu g/m^3$) but significantly better than linear weather models.
The progression from simple baselines → weather-only → time features → rich lags shows clear, systematic improvement.
HGB rich lags chronological (14.5 $\mu g/m^3$) provides the more realistic forward-in-time evaluation, showing higher error than random-split validation but still a 54.7% improvement over baseline.

Random Forest in Context

The original 2018 regression notebook included a standalone RandomForestRegressor trained on the same mixed meteorology dataset. Rerunning this model with the same train/test split (33% test, random_state=2020) gives complete metrics:

Model	RMSE ($\mu g/m^3$)	MAE ($\mu g/m^3$)	R$^2$	Position
Random Forest	20.0	13.3	0.610	3rd of 8 models

This places Random Forest:

22.8% better than Linear weather (RMSE 25.9)
Slightly behind HGB weather (RMSE 19.5), but similar performance tier
46.5% worse than the best model, Blend rich lags (RMSE 10.7)
R$^2$ of 0.610 indicates it explains 61% of PM$_{2.5}$ variance using weather alone

The key insight is that Random Forest provides a solid non-linear baseline, but the major gains come from:

Adding temporal structure (HGB weather+time: RMSE 14.6, R$^2$ 0.793)
Incorporating lag features that encode recent PM$_{2.5}$ state (HGB rich lags: RMSE 10.8, R$^2$ 0.883)
Ensemble blending across multiple strong learners (Blend rich lags: RMSE 10.7, R$^2$ 0.886)

What Each X-Axis Label Means

The chart is easiest to read if each label is treated as a different modeling scope rather than just a different algorithm.

Mean baseline

Predicts the overall mean PM$_{2.5}$ concentration.
Uses no meteorology, no time structure, and no recent PM$_{2.5}$ history.
Purpose: anchor the chart with a trivial no-skill baseline.

Linear weather

Model: linear regression.
Inputs: meteorological variables only from the mixed 2018 dataset comb_PM25_wind_Hanoi_2018_v3.csv.
Data scope: the underlying notebook describes this as a mixed meteorology set rather than a live forecast product.
Purpose: show what a simple linear weather-only relationship can do.

HGB weather

Model: HistGradientBoostingRegressor.
Inputs: the same weather-only feature scope as Linear weather.
Purpose: test whether a stronger non-linear learner helps even before adding time or lagged PM$_{2.5}$ information.

HGB weather+time

Model: HistGradientBoostingRegressor.
Inputs: weather variables plus calendar structure such as hour-of-day, day-of-week, day-of-year, and month encodings.
Purpose: test whether descriptive temporal patterning improves a weather-only model.

HGB rich lags

Model: HistGradientBoostingRegressor.
Inputs: weather, time structure, and recent observed PM$_{2.5}$ lag features and rolling summaries.
Purpose: move from pure weather regression toward short-horizon forecasting / nowcasting.

Blend rich lags

Model: weighted blend of HistGradientBoosting, Random Forest, and Extra Trees.
Inputs: the same rich-lag feature set as HGB rich lags.
Purpose: see whether a small ensemble can reduce error slightly beyond a single strong learner.

HGB rich lags chronological

Model: HistGradientBoostingRegressor.
Inputs: the same rich-lag feature scope.
Evaluation difference: trained on the earlier part of 2018 and tested on the later part of 2018.
Purpose: show the more realistic forward-in-time result, rather than a notebook-style random holdout.

Why the Best Bar Sits in the Middle

The x-axis is not ordered from worst to best.

It is ordered as a modeling story:

trivial baseline
simple weather-only model
stronger weather-only model
weather plus temporal structure
rich-lag forecasting models
stricter chronological check at the end

So Blend rich lags appears near the middle-right because the figure is organized by feature scope and evaluation regime, not by ranking. The final bar is reserved for the more realistic chronological test, which is conceptually the last step in the comparison.

Recovered Result Summary

The closest surviving writeup frames this modeling pass as a test of a simple question: do strong descriptive temporal patterns in Hanoi PM$_{2.5}$ translate into useful predictive power by themselves?

The recovered answer is no.

Later notes describe the temporal signal in the underlying Hanoi series as strong in descriptive terms:

seasonal variability was the dominant pattern, with 32% coefficient of variation
diurnal variability was present but weaker, about 8.4%
weekday effects were negligible, about 1.1%

That descriptive structure looked promising, but the predictive tests collapsed when the model relied mostly on temporal structure rather than recent PM$_{2.5}$ state and weather drivers.

Main Quantitative Takeaway

The recovered benchmark table from the later paper notes is:

Model family	Features	RMSE ($\mu g/m^3$)	R$^2$	Readout
Temporal-only (basic)	8	34.2	-0.25	failed badly
Temporal-only (enhanced)	19	34.0	-0.23	failed badly
Persistence	1	14.7	n/a	strong naive baseline
Full baseline	76	12.7	0.82	best recovered reference

That is the core modeling lesson behind this figure:

temporal-only models were about 2.3x worse than persistence
they were about 2.7x worse than the fuller baseline
adding recent PM$_{2.5}$ history and weather physics mattered far more than seasonal averages or clock-time patterns

Why the Better Models Improved

The recovered notes consistently point to the same explanation:

temporal features mostly encode climatology
richer lag features encode current pollution state
weather features encode dispersion and trapping conditions
blended models perform better because they combine both recent state and physical drivers

In plain terms, a temporal-only model can say “December evenings are usually polluted”, but it cannot tell whether a specific evening is clean after rain or extreme during a stagnant inversion. The stronger models improve because they are closer to the actual state of the atmosphere.

Interpretation for the 2018 Entry

This is why the page should be read as a model-improvement note, not just a chart gallery item. The figure documents an early but useful conclusion:

descriptive patterns are real
descriptive patterns alone are not enough for forecasting
richer lag engineering and weather-aware models are where the meaningful gains came from