Clear Sky Science · en
A hybrid simulation-machine learning proxy model for waterflood design optimization in the Bahariya Formation
Why smarter water use in oil fields matters
Getting the last drops of oil out of a reservoir usually means pushing water into the rock to sweep oil toward producing wells. This process, called waterflooding, is widespread but far from perfect: too much water can be injected in the wrong places, leaving valuable oil behind and creating large volumes of waste water. This study shows how combining classic physics-based simulations with modern machine learning can help engineers design smarter waterfloods in a particularly complex Egyptian oil field, recovering more oil with fewer trial‑and‑error runs on a computer.
A tricky underground puzzle
The Bahariya Formation in Egypt’s Western Desert is not a tidy, uniform sponge of rock. Instead, it is built from ancient river and delta deposits, with layers of sandstone and shale stitched together in irregular patterns. This patchwork creates channels where fluids can move easily and dead ends where oil gets trapped. Data from wells are limited, making it hard to describe this underground maze in detail. Traditional reservoir simulators can model such systems, but doing so thoroughly requires thousands of slow, computer‑intensive runs—too many for day‑to‑day decision making in the field.

Blending physics models with data‑driven learning
The authors built a detailed three‑dimensional computer model of the reservoir using geological and petrophysical information such as rock quality, pore space, and fluid properties. They then designed a large set of “what‑if” scenarios—1,536 in total—by varying key factors like how fast water is injected, how easily water flows through the rock, how much oil tends to remain trapped, and how light or heavy the oil is. For each scenario they tested three standard layouts of injection and production wells: a five‑spot grid in which injectors sit among surrounding producers, a staggered line drive pattern, and a peripheral scheme that injects water mostly around the field edges. The simulator reported how much oil could ultimately be recovered in each case.
Teaching a fast proxy to mimic the simulator
Instead of relying forever on slow simulations, the team trained simple machine learning models—specifically linear regression—to learn the link between the varied inputs and the final oil recovery for each well pattern. These models act as “proxies”: once trained, they can predict recovery in a fraction of a second. The researchers carefully split the data into training and testing sets and checked performance across multiple splits. For all three patterns, the models reproduced the simulator’s results with high fidelity, explaining more than 93 percent of the variability in recovery and keeping prediction errors very small. In effect, the heavy physics‑based simulator was distilled into lightweight equations that still behave in a physically sensible way.

What really controls how much oil comes out
With these fast proxy models in hand, the authors probed which factors matter most. By slightly scrambling each input in turn and seeing how predictions worsened, they found that the amount of oil left trapped in the rock after flooding—called residual oil saturation—was by far the dominant control, accounting for nearly 40 percent or more of the predictive power across all patterns. Oil quality also mattered: lighter oils flowed more readily and boosted recovery, while rocks that allowed water to move too easily tended to cause early water “breakthrough” and reduced sweep. Interestingly, the importance of injection rate depended strongly on the pattern. In the peripheral scheme, how hard water was pushed in had a much larger impact than in the grid‑like five‑spot layout, revealing that different well arrangements respond to different levers.
From computer insight to field decisions
Putting these pieces together, the study shows that there is no universally best waterflood pattern. In the simulated Bahariya field, the peripheral layout delivered the highest ultimate recovery, followed by the staggered line drive and then the five‑spot pattern. But this ranking emerged only after accounting for the specific rock fabric and fluid properties of this reservoir. The hybrid workflow—using simulations to generate data, then machine learning to analyze it quickly—offers field engineers a practical way to screen many waterflood options, understand which parameters truly matter, and tune injection strategies without running thousands of full simulations. For a lay reader, the takeaway is simple: by letting computers both obey the laws of physics and learn from data, operators can get more oil from the same rocks while using water and computing resources more efficiently.
Citation: Gad, R., Salem, A.M., El Farouk, O.M. et al. A hybrid simulation-machine learning proxy model for waterflood design optimization in the Bahariya Formation. Sci Rep 16, 14023 (2026). https://doi.org/10.1038/s41598-026-49561-5
Keywords: waterflooding, machine learning, oil recovery, reservoir simulation, Bahariya Formation