Clear Sky Science · en

Imputation methods for serologic biomarkers in inflammatory bowel disease

2026-02-26 · Back to index

Why this research matters for patients and doctors

Blood tests that measure antibodies in people with inflammatory bowel disease (IBD) are increasingly used to help diagnose the condition, tell Crohn’s disease from ulcerative colitis, and even hint at how the illness may unfold. But in the real world, many of these blood measurements are missing because samples are hard to collect and patients are hard to follow over time. This study asks a deceptively simple question with big consequences: when key pieces of those blood-test puzzles are missing, what is the best way to fill in the gaps so that doctors and researchers can still trust their results?

Hidden holes in blood-test data

IBD, which includes Crohn’s disease and ulcerative colitis, is driven by chronic inflammation in the digestive tract. Certain antibodies in the blood—directed against yeast, bacteria, and other targets—have become powerful clues for spotting IBD, distinguishing its subtypes, and sometimes predicting disease years before symptoms appear. However, assembling large serology datasets from thousands of patients is messy. Samples may be misplaced, some tests may fail, or patients may skip visits. Traditional quick fixes, such as throwing away any patient with a missing value, waste information and can skew results, making diseases look less or more strongly associated with certain markers than they really are.

Different ways that data can go missing

The authors first carefully recreated the many ways that blood-test values might be absent. In one scenario, values disappear completely at random, like coin flips across a data table. In another, missing values depend on other information we do see—for example, people with milder disease may be less likely to have certain tests done. In the most difficult scenario, the missingness depends on the very value we do not observe—for instance, extremely high or low antibody levels are less likely to be recorded. Using three large IBD cohorts, the team generated thousands of versions of their datasets with varying amounts of missing information, from just 5% up to a hefty 40% of blood-test entries blank.

Modern tools to fill in the blanks

They then compared families of computer methods for filling in the gaps—an approach known as imputation. Some methods, such as MICE (Multiple Imputation by Chained Equations) and related "iterative imputers," repeatedly predict each missing value from the others, cycling until the whole table is filled. Others use more flexible machine learning engines, including random forests, nearest-neighbor methods that borrow information from similar patients, and deep-learning models called autoencoders and variational autoencoders that learn compressed summaries of the data and reconstruct missing pieces from those summaries. For each setup, the researchers created multiple completed datasets to capture uncertainty and evaluated performance from three angles: how close the filled-in numbers were to the originals, how well standard statistical tests recovered known disease–antibody links, and how accurately predictive models could distinguish IBD subtypes.

What works best under different conditions
Figure 1.

No single method emerged as a universal champion. When only a small slice of data was missing, and the gaps were fairly well behaved, iterative methods—especially those built on Bayesian regression, random forests, or nearest neighbors—tended to give the most accurate reconstructions and preserved the strength of associations seen in the full data. As more values vanished, especially under tougher missingness patterns, deep-learning approaches based on autoencoders became increasingly attractive. These models were better at preserving the overall structure of the data and keeping prediction performance close to what would have been obtained with complete information. Across the board, simply discarding incomplete cases performed worse: it weakened signals, reduced statistical power, and did not offer any advantage in terms of false-positive error control.

Choosing the right tool for the job
Figure 2.

The study’s bottom line is practical rather than prescriptive. For projects where the priority is sound statistical inference—such as estimating how strongly a specific antibody is linked to Crohn’s disease—methods that follow multiple-imputation principles, like MICE and certain iterative imputers, are a sensible first choice. They pair well with established rules for combining results across imputed datasets and provide well-calibrated uncertainty estimates. In contrast, when the main goal is prediction—such as training a machine learning model to classify patients—iterative imputers and autoencoder-based approaches often shine, particularly when the share of missing values is high. By showing that different methods excel under different missingness levels and analysis goals, this work offers a roadmap for researchers to select imputation strategies that preserve both the scientific signal and the clinical usefulness of serologic data in IBD.

What this means in plain terms

For people living with IBD and the clinicians and scientists who care for them, the message is reassuring but nuanced: even when blood-test records are riddled with gaps, carefully chosen computational methods can reconstruct enough of the picture to keep analyses reliable. There is no one-size-fits-all solution, but there are clear patterns—simpler iterative methods work well when data are mostly complete, while more flexible deep-learning tools are better when the holes are larger and more complicated. Using these approaches instead of discarding imperfect data helps protect against misleading conclusions and supports more accurate diagnosis, disease monitoring, and treatment research built on serologic biomarkers.

Citation: Boodaghidizaji, M., McGovern, D.P.B. & Li, D. Imputation methods for serologic biomarkers in inflammatory bowel disease. Sci Rep 16, 11160 (2026). https://doi.org/10.1038/s41598-026-41587-z

Keywords: inflammatory bowel disease, serologic biomarkers, missing data, multiple imputation, machine learning