Clear Sky Science · en

A weakly supervised transformer for rare disease diagnosis and subphenotyping from EHRs with pulmonary case studies

· Back to index

Why finding rare diseases faster matters

For families living with rare illnesses, getting a name for what is wrong can take years. Symptoms are often vague, doctors may see only a handful of such cases in a lifetime, and existing tests do not always give clear answers. This study explores a new way to use the digital footprints left in electronic health records to spot two hard-to-diagnose lung conditions earlier and to sort patients into groups that may face very different futures.

Figure 1
Figure 1.

The long road to a rare diagnosis

Rare diseases are uncommon one by one, but together they affect hundreds of millions of people worldwide. Many begin in childhood and can be life‑threatening if missed. The paper focuses on rare lung diseases, where everyday complaints like shortness of breath or wheeze can easily be mistaken for asthma or other common problems. As a result, children with conditions such as pulmonary hypertension or severe forms of asthma may see many specialists and wait years before getting the right diagnosis, losing precious time when early treatment could change the course of their disease.

Turning messy medical records into clues

Modern hospitals store huge amounts of information in electronic health records, from diagnosis codes and prescriptions to lab tests and doctors’ notes. Hidden in these data are patterns that can hint at a rare disease long before it is formally named. But there is a catch: only a small fraction of patients have been carefully reviewed by experts, so high‑quality labels that say who truly has a disease are scarce. Most records carry only rough, “noisy” signals—codes that may reflect billing quirks, tentative guesses, or outdated labels. Traditional computer models struggle in this setting because they are built to learn from large collections of clean, trustworthy examples.

A new way of learning from imperfect data

The authors introduce WEST, a "weakly supervised transformer" that is designed to learn from this mix of a few accurate labels and many uncertain ones. The system starts with two groups of patients at Boston Children’s Hospital who might have pulmonary hypertension or severe asthma, identified by broad screening codes. Within each group, a small subset has been confirmed by specialists, while the rest receive probabilistic scores from earlier, rule‑based tools. WEST uses a transformer—an advanced pattern‑finding architecture originally developed for language—to turn each child’s entire medical history into a compact numerical portrait. Crucially, it does not treat the rough labels as fixed truth: after each training round, the model updates its own estimates of who is likely to be sick and feeds those refined probabilities back into the next round, steadily cleaning up the signal.

Figure 2
Figure 2.

What the model discovered in lung disease

When tested on held‑out, expert‑validated patients, WEST was more accurate than several alternatives, including simple code‑counting rules, gradient‑boosted trees, and transformers that either ignored the noisy labels or accepted them at face value. It needed surprisingly few gold‑standard cases to perform well—about 100 carefully reviewed patients were enough to match or beat other approaches. Beyond saying who probably had each condition, the model’s internal representations naturally grouped children into clinically meaningful clusters. For pulmonary hypertension, WEST separated patients into a slow‑progressing group and a fast‑progressing group, which showed clearly different survival patterns over five years. For severe asthma, it split patients into those with frequent, dangerous flare‑ups and those with relatively fewer attacks, mirroring differences in hospitalizations, low‑oxygen episodes, and respiratory failure.

How this could change care for patients

To a non‑specialist, the key message is that WEST learns to “see” complex disease patterns in routine hospital data without relying on huge, perfectly labeled datasets. By cleverly recycling imperfect signals and a small amount of expert input, it can flag likely rare‑disease cases more accurately and reveal hidden subgroups that face different risks. In the long run, systems like WEST could help shorten the diagnostic odyssey for children with rare lung diseases, guide doctors toward earlier specialist referral, and support more tailored monitoring and treatment plans based on how a patient’s disease is likely to unfold.

Citation: Greco, K.F., Yang, Z., Li, M. et al. A weakly supervised transformer for rare disease diagnosis and subphenotyping from EHRs with pulmonary case studies. npj Digit. Med. 9, 211 (2026). https://doi.org/10.1038/s41746-026-02406-x

Keywords: rare disease diagnosis, electronic health records, machine learning in medicine, pulmonary hypertension, severe asthma