Clear Sky Science · en
Soil microbiome prediction using traditional machine learning and deep learning models
Why the Tiny Life in Soil Matters
Every teaspoon of soil holds a teeming world of bacteria and fungi that quietly power our food production, store carbon, and recycle nutrients. Yet we still struggle to predict which microbes will live where, or how they will respond as climate and land use change. This study asks a practical question: can modern computer models, trained on basic environmental measurements like temperature, rainfall, and soil chemistry, reliably forecast the makeup of these hidden communities?

Big Data Meets the Underground World
The researchers focused on the soil microbiome, the vast community of microorganisms living in dirt, and treated it as a system that might be predictable from its surroundings. Using two large public datasets from global soil surveys and from the U.S. National Ecological Observatory Network (NEON), they assembled information on bacterial and fungal communities alongside measurements such as soil pH, carbon and nitrogen content, climate, and vegetation. Instead of tracking every single species, they grouped microbes into broader categories: taxonomic levels like phylum, class, order, family, and genus, and functional groups that describe what microbes do, such as cycling carbon or nitrogen.
Testing Different Ways to Learn from Data
To turn environmental measurements into predictions of community composition, the team compared seven computer modeling approaches. Six were widely used “traditional” machine learning methods, including linear regression, decision trees, random forests, gradient boosting, and k-nearest neighbors. The seventh was a deep learning model called a multilayer perceptron, a type of neural network. For each dataset and each taxonomic or functional level, the models were trained on most of the samples and then asked to predict the relative abundances of microbial groups in new, unseen soil samples. The accuracy of these predictions was measured with a standard statistic (R²) that reflects how much of the real-world variation the model can explain.

Patterns Across Scales in the Soil Community
A clear pattern emerged: it is easier to predict broad groupings of microbes than to predict fine detail. For both bacteria and fungi, models generally performed best at higher taxonomic levels, such as phylum and class, and became less accurate as they tried to distinguish smaller categories like family and genus. This suggests that while the exact mix of closely related microbes can be hard to foresee, the overall structure of the community is more tightly linked to the environment. An exception appeared for bacterial functional groups in one dataset, where none of the models captured the patterns well, likely because the chosen functional categories did not fully reflect the true complexity of microbial roles.
Which Models Worked Best and Why
Among all approaches tested, two traditional methods—random forest and k-nearest neighbors—consistently gave the strongest predictions. Random forests excelled at broader taxonomic levels, while k-nearest neighbors was especially effective at the more detailed family and genus levels. Gradient boosting sometimes matched or exceeded these models, particularly for fungal functional groups, but its performance varied more from one level to another. Surprisingly, the deep learning neural network rarely outperformed these simpler methods. The authors argue that this is largely because deep learning typically requires far more training data than the few hundred to a couple thousand soil samples available here. Overall, bacterial communities were easier to predict than fungal ones, and datasets with more samples yielded better results.
What This Means for Soil Stewardship
The study shows that, even with today’s imperfect data, machine learning can already provide reasonably good forecasts of soil microbial communities at broad levels. That is encouraging for efforts to manage soils for agriculture, restoration, and climate mitigation, because it suggests we can use relatively simple environmental measurements to anticipate big-picture shifts in the underground world. At the same time, the difficulty of predicting fine-scale details and certain functional groups highlights how much we still do not know about soil organisms and their roles. Better, larger datasets and richer descriptions of microbial functions will be needed before deep learning and other advanced tools can reach their full potential in guiding how we care for the living soil beneath our feet.
Citation: Aouabed, Z., Therrien, V., Bouaoune, M.A. et al. Soil microbiome prediction using traditional machine learning and deep learning models. Sci Rep 16, 11069 (2026). https://doi.org/10.1038/s41598-026-39537-w
Keywords: soil microbiome, machine learning, bacteria and fungi, environmental gradients, community prediction