Clear Sky Science · en

Conformal selective prediction with cost aware deferral for safe clinical triage under distribution shift

2026-02-20 · Back to index

Why this matters for patients and clinicians

When someone in intensive care begins to slide toward sepsis, every hour can mean the difference between life and death. Hospitals are turning to artificial intelligence (AI) to flag these high‑risk patients early, but most systems still behave like overconfident oracles: they always give an answer, even when they are unsure or faced with new kinds of cases. This paper explores a different approach—an AI assistant that knows when to speak up and when to hand a case back to human clinicians, with the explicit goal of keeping patients safe even as hospital conditions change over time.

A smarter way to say “I’m not sure”

The authors build a triage framework for early sepsis prediction that does not force the model to decide on every patient. Instead, it lets the system either make a prediction or defer to a clinician. The key idea is to treat this as a cost problem: missing a true sepsis case is far worse than sounding an extra alarm or asking for a human review. The model is trained on past intensive‑care data and then calibrated so that its probability scores actually match reality. On top of this, it wraps each prediction in an uncertainty “shell,” a small set that almost always contains the true answer. The system then uses a single transparency‑friendly rule: if its confidence in the top label falls below a chosen threshold, it defers the case to a clinician; otherwise, it predicts.

Designing for changing hospital conditions

A major worry with clinical AI is that hospitals evolve—treatments, patient mixes, and recording practices shift over months and years—so a model that worked yesterday may be less reliable today. To probe this, the study uses an intensive‑care dataset where patients are split not only into development and test sets, but also into “in‑distribution” (earlier time period) and “out‑of‑distribution” (later time period) groups. The framework builds three flavors of uncertainty sets: a standard version, a version tailored to separate demographic groups (here, gender), and a version that explicitly adjusts for time‑related changes in the data. All three aim for the same nominal reliability level, but the adjusted and group‑aware versions are designed to hold up better when the hospital environment drifts.

What happens when the model can defer

The results show that allowing the model to abstain on uncertain cases sharply improves the quality of the predictions it does keep. At a setting where it still provides answers for 80% of patients, the error rate among these “retained” cases drops by roughly half compared with forcing the model to predict for everyone, both in the original time period and under later temporal shift. The single confidence threshold that was tuned on a held‑out calibration group yields low expected clinical cost on both test splits, and this cost rises only moderately when the data distribution changes. Importantly, the model remains well‑calibrated: when it says a case has a certain chance of sepsis, that figure closely matches what is observed in reality, which is essential if clinicians are to trust its warnings and deferrals.

Keeping fairness and reliability in view

Because clinical tools must work for all patients, the authors also inspect performance across demographic subgroups. By constructing separate uncertainty sets for male and female patients, the system equalizes how often the true outcome falls inside its predicted set, shrinking the gender gap in this coverage measure to about one percentage point. At the same time, a version that reweights past data to mimic the later patient mix shows the smallest drop in reliability when moving from the earlier to the later cohort. Across methods, the uncertainty sets stay compact—typically pointing to a single label—so clinicians are not overwhelmed with ambiguous outputs. Instead, larger sets become rare, natural flags that particular cases deserve closer human attention.

What this means for real‑world triage

For non‑specialists, the takeaway is that the authors are not just chasing higher accuracy scores; they are engineering an AI assistant that is cautious by design. By combining honest uncertainty estimates, a clear rule for when to defer, and a cost model that heavily penalizes missed sepsis, the framework cuts errors on automatically handled patients while keeping overall harm low, even as hospital conditions shift. The approach also makes fairness and monitoring part of the design rather than an afterthought. In practice, such a system would not replace clinicians, but instead act as a safety‑focused filter—handling straightforward cases confidently, flagging borderline ones for human review, and providing transparent knobs that hospitals can tune to match their own risk tolerance and resource limits.

Citation: Kwon, H., Kim, DJ. Conformal selective prediction with cost aware deferral for safe clinical triage under distribution shift. Sci Rep 16, 10016 (2026). https://doi.org/10.1038/s41598-026-40637-w

Keywords: clinical triage, sepsis prediction, uncertainty in AI, selective prediction, healthcare safety