Clear Sky Science · en
Source identification of sudden water pollution events in the Dongliao River using a hybrid machine learning framework
Why sudden river pollution matters to everyone
When a factory spill or pipe break sends a pulse of contamination into a river, communities downstream may have only hours to protect drinking water intakes and ecosystems. Knowing exactly where the pollution came from, how strong it was, and how long it lasted is essential for holding the right parties accountable and responding effectively. This study focuses on China’s Dongliao River and shows how combining physics-based simulations with modern machine learning can pinpoint hidden pollution sources quickly and with a realistic sense of uncertainty, even when field data are noisy or scarce.

Following a spill down a real river
The researchers examined a nearly 30-kilometer stretch of the Dongliao River bordered by industrial parks that could cause sudden pollution incidents. They imagined emergency scenarios in which a single, brief discharge of contaminants—measured as common water quality indicators like chemical oxygen demand, ammonia, and phosphorus—enters the river from one bank. Five virtual monitoring sites were placed downstream to record how the pollution wave travels and how its peak concentration changes along the way. Because real accidents are rare and often poorly monitored, the team relied on a detailed computer model of river flow and pollutant transport to create many realistic “what if” events.
Turning heavy simulations into a fast stand‑in
Traditional river models solve complex equations that describe how water moves and how pollutants spread and dilute. These tools are powerful but slow: a single high‑fidelity simulation of the Dongliao reach can take about an hour, far too long for rapid emergency decisions or for exploring thousands of possible spill scenarios. To overcome this, the authors built a lightweight stand‑in model, called a surrogate, using machine learning. They generated 180 synthetic spill events with the physics-based model and used these as training data for three algorithms. A neural‑network approach known as long short‑term memory (LSTM) clearly outperformed the other candidates, closely reproducing the original model’s predictions of peak pollution levels at all monitoring points while being able to run almost instantly.
Hunting for the hidden source
With the fast surrogate in hand, the team tackled the inverse problem: given the pollution measured downstream, can we infer where the spill happened and how strong it was? First, they used a deterministic strategy, which searches for a single best‑fit answer. Here, a nature‑inspired search method based on humpback whales’ cooperative hunting patterns—the whale optimization algorithm—tested many possible combinations of source location, strength, and duration. For each trial, the LSTM surrogate predicted downstream concentrations, which were compared with the synthetic “observations.” This whale‑LSTM pairing generally beat two other popular search methods in accuracy and speed, reducing typical errors in key source parameters to just a few percent under ideal, noise‑free data.

Adding uncertainty for real‑world noise
Real measurements are never perfect: instruments have errors, conditions change, and models are approximate. The researchers therefore built a second, probabilistic system that looks not for a single answer, but for a full range of plausible spill scenarios and how likely each one is. They wrapped the whale‑LSTM engine inside a Bayesian framework, which treats unknown source characteristics as variables with probability distributions. The modified algorithm lets the search occasionally accept slightly worse solutions to explore more widely, then uses statistical tools to summarize where the search spent most of its time. The result is a set of probability curves for each source parameter, such as distance from the upstream boundary or pollutant strength, along with ranges that capture the most credible values.
What this means for protecting rivers
When the team introduced measurement noise similar to what field sensors might experience, the limits of the deterministic approach became clear: some parameters drifted far from their true values. The probabilistic method, by contrast, remained stable, typically keeping errors below 7% for most release characteristics and providing clear uncertainty ranges around each estimate. Crucially, the entire probabilistic analysis for a spill can be completed in a few minutes on ordinary hardware. For emergency managers, this means they can rapidly infer where a sudden pollutant pulse likely came from and how severe it was, while also seeing how confident those inferences are. The framework offers a practical path toward intelligent early‑warning systems that blend physics, data, and probability to safeguard surface waters.
Citation: Wang, Y., Wang, Y., Shi, P. et al. Source identification of sudden water pollution events in the Dongliao River using a hybrid machine learning framework. Sci Rep 16, 11976 (2026). https://doi.org/10.1038/s41598-026-41724-8
Keywords: river pollution, source identification, machine learning, Bayesian inversion, water quality monitoring