Clear Sky Science · en
Applicability analysis of tree-based ensemble learning for air pollutant prediction models
Why cleaner air needs smarter forecasts
People in big cities often wake up wondering if the air outside is safe for a jog, a commute, or letting children play outdoors. Weather apps now show air quality indexes next to temperature, but these numbers are only as good as the models behind them. This study asks a practical question with real-world stakes: which modern artificial-intelligence tools do the best job at predicting several major air pollutants at once, and why?
Tracking city air day by day
The researchers focused on four of China’s largest municipalities—Beijing, Shanghai, Tianjin and Chongqing—because they span different climates and pollution patterns, from winter smog to summer ozone. They assembled more than five thousand daily records from 2021 to 2024, each combining measurements of six key pollutants (including fine particles, dust, nitrogen dioxide, sulfur dioxide, carbon monoxide and ozone) with weather data such as temperature, humidity, wind, rainfall and air pressure. To make the most of these observations, they added extra clues: how pollution on previous days might carry over, how temperature and wind interact to disperse dirty air, and how combined measures of particles and gases might better reflect health risks.

Teaching digital “trees” to read the air
Instead of using traditional physics-heavy weather models, the team turned to a family of data-driven tools known as tree-based machine learning. These algorithms make decisions by repeatedly splitting data into branches, a bit like a twenty-questions game that homes in on the final answer. The study compared three versions: a simple decision tree; a random forest, which averages the results of many trees to smooth out noise; and gradient boosting, which builds trees one after another to gradually correct earlier mistakes. The scientists carefully tuned each method and used a time-aware testing strategy so that the models learned from past days and were evaluated on later ones, mirroring real forecasting conditions.
Which models shine for which pollutants
The showdown revealed that no single method is best for everything, but some standouts emerged. Random forests were exceptionally accurate for fine and coarse particles and for sulfur dioxide, explaining about 99 percent of the variation in their levels—close to what instruments themselves can measure. For carbon monoxide and nitrogen dioxide, a form of gradient boosting nearly matched the forest’s performance, suggesting that this stepwise correction approach is well suited to traffic-related and combustion emissions that spike and fall quickly. Surprisingly, the plain decision tree, despite being the simplest tool, held its own in predicting ozone, a pollutant that forms through sunlight-driven chemistry and tends to follow threshold-like patterns the branching rules can capture.
Peeking inside the black box
To make these powerful models useful for policy, the authors needed to show not just how well they predict, but why. They used a technique called SHAP, which assigns each input—such as temperature, wind speed or another pollutant—a contribution score for every forecast. This analysis uncovered some revealing links. Carbon monoxide emerged as a key helper in building fine particles, consistent with its role as a marker of incomplete burning that produces particle-forming vapors. Temperature strongly boosted ozone, reflecting the way hot, sunny days supercharge its production. Humid air interacting with sulfur dioxide tended to hold particle growth in check, and strong winds helped clean out tiny particles until a threshold, beyond which turbulent mixing could actually trap them locally. These patterns connect the math back to real atmospheric processes, offering clues for targeted controls.

From research code to city warning systems
Despite impressive accuracy, the authors note that the models still struggle during the most severe smog episodes and are limited by coarse descriptions of where emissions originate and by the relatively short time window of data. They propose combining traditional weather–chemistry simulations with machine learning and using the SHAP insights to design smarter emergency responses when pollution spikes. Their framework is already being used in a regional air quality warning system serving Beijing and neighboring cities. In everyday terms, the study shows that carefully chosen and well-explained artificial intelligence can give city officials earlier, more trustworthy warnings about bad air days—and clearer guidance on which sources to tackle first.
Citation: Zhu, X., Li, B., Cao, Y. et al. Applicability analysis of tree-based ensemble learning for air pollutant prediction models. Sci Rep 16, 9602 (2026). https://doi.org/10.1038/s41598-025-32652-0
Keywords: air quality forecasting, urban air pollution, machine learning models, random forest, multi-pollutant prediction