Clear Sky Science · en
Multidisciplinary prediction of running-related injuries using machine learning
Why this matters to runners
Endurance running is one of the most popular ways to stay fit, but nearly half of regular runners will suffer a significant injury each year. These problems can derail training, hurt quality of life, and rack up medical bills. This study asks a practical question with cutting-edge tools: can we combine information about a runner’s body, lifestyle, and training into a computer model that warns when they are heading toward injury, before it actually happens?

Looking at the whole runner, not just the shoes
Most past studies tried to link running injuries to just one thing at a time—such as training volume, shoe type, or a single strength measure. But real-world injuries usually arise from a tangle of influences: genetics, past injuries, muscle strength, movement patterns, body build, diet, and how training loads change over time. In this work, researchers assembled a rare, richly detailed picture of 142 competitive endurance runners, aged 14 to 50, followed for a full year. For each runner they collected laboratory measures of bone and muscle, motion analysis of running style, strength tests, body composition scans, nutrition data, genetic markers related to tissue health, and detailed weekly reports of training and injuries. In total, this produced more than six thousand weekly snapshots linking what the runner was like and what they were doing to whether they developed a running-related problem.
Teaching computers to spot injury risk
With this dataset in hand, the team trained several types of machine learning models to predict whether a runner would report a new running-related injury in a given week. Some models were simple and easy to interpret, like logistic regression, while others were more flexible but more opaque, such as random forests, boosting methods, support vector machines, and neural networks. The researchers built two main versions of the prediction task. One used only risk factors with strong prior scientific support, such as sex, age, past injury days, certain strength and alignment measures, key training load metrics, and selected gene variants. The other version threw in a much broader set of additional, more exploratory factors to see whether model performance improved when more information was provided.

What the models could and could not do
The best-performing approach was an ensemble method called a random forest, which reached a moderate level of accuracy (area under the curve around 0.78) when predicting weekly injury risk. This performance slightly surpasses earlier studies that focused only on training data in runners, and is comparable to the better results reported in mixed track and field athletes. Interestingly, most models did not benefit from simply adding more, weaker-evidence variables: their accuracy stayed about the same regardless of whether they used a carefully curated list or the full, larger feature set. A notable exception was logistic regression, a relatively simple method, which improved markedly when given the broader pool of variables and moved from near the bottom to among the better performers. By contrast, probabilistic models that relied on strong independence assumptions between variables performed poorly, likely because many of the risk factors are correlated or interact in complex ways.
Limits today, potential for tools tomorrow
Despite the careful design, the models are not yet accurate enough for clinical use or for making firm training decisions. One major reason is scale: 142 runners and just over 6000 weekly samples are small numbers for such a complex problem, especially when considering the wide variety in age, competition level, preferred distances, and running surfaces. The study also relied on self-reported injuries and some infrequent measures, such as occasional food diaries, which may blur important short-term changes. In addition, the models were only tested within this single group of runners, so it is unclear how well they would generalize to new populations. The authors suggest that larger, pooled datasets, combined with data streams from wearables and automated diet or sleep tracking, could provide the richer, more frequent information that machine learning models need to deliver stronger, more reliable predictions.
What this means for everyday runners
For now, this research does not produce a ready-made app that tells you exactly when you will get hurt. Instead, it offers a blueprint and a public dataset that other scientists can build on. It shows that computers can learn meaningful patterns from a broad, realistic mix of genetic, physical, and training information, but also that predicting running injuries is inherently difficult. As future studies add more runners, better sensors, and deeper analysis, this line of work could eventually power decision-support tools that give runners personalized guidance on how hard to train, when to ease off, and which modifiable factors—such as strength or nutrition—deserve extra attention to keep them running pain-free.
Citation: Wu, H., Brooke-Wavell, K., Barnes, M.R. et al. Multidisciplinary prediction of running-related injuries using machine learning. npj Digit. Med. 9, 213 (2026). https://doi.org/10.1038/s41746-026-02413-y
Keywords: running injuries, machine learning, sports medicine, injury prediction, endurance running