Clear Sky Science · en

Ensemble learning for air quality index prediction: integrating gradient boosting, XGBoost, and stacking with SHAP-based interpretability

· Back to index

Why Cleaner Air Needs Smarter Forecasts

Air pollution silently shapes our daily lives, from the air we breathe on the way to work to the health of our children and elders. Yet most of us only see a single air quality number on a phone app, without knowing how it is predicted or how reliable it is. This paper explores a new, smarter way to forecast the Air Quality Index (AQI) using a team of cooperating computer models. By making these forecasts both more accurate and more transparent, the work points toward timelier health warnings, better city planning, and more informed choices for everyday life.

How Dirty Air Affects People and Cities

The study starts by outlining how modern life feeds air pollution. Rapid urban growth, heavy traffic, booming construction, and the burning of fossil fuels release a mix of harmful substances into the atmosphere. Tiny particles (PM2.5 and PM10), gases such as ozone, nitrogen oxides, sulfur dioxide, and carbon monoxide together damage lungs, strain the heart, and are linked to millions of premature deaths each year. Beyond health, polluted air harms crops, erodes buildings, lowers worker productivity, and worsens climate change. Because these impacts are wide-ranging and expensive, cities urgently need reliable forecasts that can warn people in advance, guide traffic and industry controls, and support long-term environmental policy.

Turning Complex Air Data into a Single Health Number

AQI condenses many measurements into a single scale from clean to hazardous air. To predict this number, the researchers used a massive open dataset from Taiwan: more than 4.6 million hourly records from 74 monitoring stations collected between 2016 and 2024. Each record lists levels of key pollutants, short-term averages that capture recent exposure, and weather conditions like wind speed and direction. The team first cleaned the data, dealt carefully with missing values and extremes, and standardized the numbers so that no single measurement dominated the others. They then set aside separate portions for training, tuning, and testing, and even mimicked real time by checking how well models performed on later years that the algorithms had not yet seen.

Figure 1
Figure 1.

Building a Team of Models Instead of Trusting Just One

Rather than relying on a single predictive formula, the authors built an “ensemble” model—a weighted voting system that combines several powerful tree-based methods. These include Gradient Boosting, XGBoost, LightGBM, and CatBoost, each of which learns patterns by building many small decision trees and correcting earlier errors. The ensemble gives extra weight to the strongest performers (more weight to Gradient Boosting, slightly less to CatBoost, and so on), much like listening more closely to the most accurate weather forecasters in a panel. Using rigorous search procedures and cross-validation, the authors carefully tuned the settings of each individual model so that, together, they captured subtle nonlinear links among pollutants, weather, and AQI while avoiding overfitting to past data.

Outperforming Deep Learning and Seeing Inside the Black Box

The authors compared this ensemble against a wide range of alternatives, from simple linear regression and basic decision trees to modern deep learning systems such as LSTM, CNN-LSTM, and Transformer networks. On key measures of prediction error, the ensemble consistently came out on top. It achieved extremely low error and explained more than 99% of the variation in AQI on unseen data, and it barely lost accuracy when tested on future periods, a sign of robustness under changing conditions. To open the “black box,” the team used interpretability tools called partial dependence plots and SHAP values. These tools reveal which inputs matter most and how they influence the forecast. The results highlight fine particles (PM2.5 and its short-term average), ozone over eight hours, and PM10 averages as the most influential drivers of AQI. They also uncover threshold behaviors, such as a sharp jump in predicted risk when sulfur dioxide passes a certain level, confirming that the system is learning meaningful, health-relevant patterns.

Figure 2
Figure 2.

What This Means for Daily Life and Future Cities

For non-specialists, the key message is that air quality forecasts can be both highly accurate and understandable. By combining several complementary models and shining light on how they make decisions, this work delivers a forecasting engine that cities could plug into real-time monitoring systems. Such a tool could trigger earlier health alerts, guide school and outdoor activity planning, or support targeted traffic restrictions on days when pollution is poised to spike. Because the approach uses standard pollutant and weather measurements, it can be adapted to other regions, retrained as conditions change, and paired with new spatial methods to cover entire urban areas. In short, smarter and more transparent AQI prediction can become a practical building block for healthier, more resilient cities.

Citation: Singh, S., Kumar, M., Sengar, V. et al. Ensemble learning for air quality index prediction: integrating gradient boosting, XGBoost, and stacking with SHAP-based interpretability. Sci Rep 16, 8544 (2026). https://doi.org/10.1038/s41598-026-39232-w

Keywords: air quality index, ensemble learning, gradient boosting, pollution forecasting, model interpretability