Clear Sky Science · en

Accurate and interpretable prediction of chemical oxygen demand using explainable boosting algorithms with SHAP analysis

2026-02-13 · Back to index

Why Watching a River’s Oxygen Matters

Rivers are the lifeblood of cities and farms, but when they fill with organic waste from factories, sewers, or fields, the water can be starved of oxygen and become unsafe for people and ecosystems. A common health check for rivers is “chemical oxygen demand” (COD), a measure of how much oxygen is needed to break down pollution. Measuring COD in the lab is slow and costly, so this study explores whether advanced but explainable machine-learning tools can reliably predict COD from routine sensor data—and show clearly what is driving pollution.

Smart Models for a Polluted World

The researchers focused on two river monitoring stations in South Korea, Hwangji and Toilchun, just upstream of the multipurpose Yeongju Dam. At these stations, decades of records exist for common water-quality indicators: acidity (pH), dissolved oxygen, suspended solids (fine particles in the water), nutrients such as nitrogen and phosphorus, total organic carbon (TOC), biochemical oxygen demand (BOD₅), water temperature, electrical conductivity, and river flow. Instead of building a traditional physics-based model—which can be hard to transfer from one river to another—they tested six “boosting” algorithms, a powerful family of machine-learning methods that combine many simple decision trees into a strong predictor.

Finding the Best River “Forecaster”

To compare the six boosting methods (AdaBoost, CatBoost, XGBoost, LightGBM, HistGBRT, and NGBoost), the team trained the models on about 70% of the historical data and checked performance on the remaining 30%. They judged accuracy using several statistics that capture how close predictions are to real COD measurements and how well the models generalize to unseen conditions. At Toilchun station, the NGBoost model—which predicts not just a single value but a full probability range for COD—was the clear winner, capturing nearly all of the variation in COD with very small errors. At Hwangji, which is a more complex site, CatBoost gave the best balance of accuracy and stability. Some models, especially XGBoost, looked almost perfect on the training data but stumbled on the test data, a classic sign of “overfitting,” where a model memorizes noise rather than learning real patterns.

Opening the Black Box of AI

A central aim of the study was not only to predict COD, but also to explain why the models made their predictions. For this, the authors used SHAP (Shapley Additive Explanations), a technique that assigns each input variable a contribution—positive or negative—to each individual prediction. Across both rivers and across most algorithms, three variables consistently emerged as the main drivers of COD: total organic carbon (TOC), biochemical oxygen demand (BOD₅), and suspended solids (SS). In simple terms, the more organic material and fine particles in the water, the higher the oxygen demand. The models also revealed site-specific differences: at Toilchun, discharge (flow) and total phosphorus played a stronger role, suggesting a bigger influence of diffuse sources such as agricultural runoff; at Hwangji, patterns in conductivity and suspended solids hinted at more localized or industrial sources.

What the Results Mean for Real Rivers

These insights show that boosting models, when paired with SHAP, can move beyond being opaque “black boxes.” They provide both sharp forecasts of river oxygen demand and a physically sensible story about what is driving pollution at each site. This matters for managers of dams and river basins who must prioritize what to monitor and where to intervene: if TOC and BOD₅ are the strongest levers, then controlling organic waste inputs can yield the biggest improvement in water quality. The probabilistic forecasts from NGBoost also give a sense of uncertainty, which is crucial for early-warning systems and risk-based decisions. In short, the study demonstrates that carefully designed, explainable AI can help protect drinking water reservoirs and aquatic life by turning routine sensor readings into reliable, transparent predictions of river health.

Citation: Merabet, K., Kim, S., Heddam, S. et al. Accurate and interpretable prediction of chemical oxygen demand using explainable boosting algorithms with SHAP analysis. Sci Rep 16, 6359 (2026). https://doi.org/10.1038/s41598-026-38757-4

Keywords: water quality, chemical oxygen demand, machine learning, river pollution, explainable AI