Clear Sky Science · en

Machine learning approaches to predict the risk of tuberculosis among household contacts of index TB patients in Central Ethiopia

· Back to index

Why this matters for families

Tuberculosis (TB) is often thought of as a disease caught from strangers on buses or in crowded markets, but many infections actually happen at home. When one person in a household has TB, their relatives share air, rooms and beds—yet only a few will go on to develop the disease. This study from Central Ethiopia asks a practical question with global relevance: can computers help us quickly pick out which family members are most likely to fall sick, so that scarce tests and medicines are used where they are needed most?

Life inside the homes studied

The researchers worked with health teams who routinely visit the homes of people diagnosed with infectious lung TB. In four rural districts and three small towns, they collected detailed information on 387 "index" TB patients and 1,277 people living with them. Many households were crowded, with a typical family of four squeezed into small homes that often had only one room and one window. Most families cooked over wood or charcoal fires, filling the air with smoke. Many household members were children or young adults, and almost half of both patients and contacts had little or no formal education. These are the kinds of settings where TB spreads easily—but even here, only 23 household members (about 2 in 100) were ultimately diagnosed with TB.

Figure 1
Figure 1.

Turning home visits into data

Each home visit generated a rich picture of everyday life and health. For each contact, the team recorded age, sex, vaccination status, cough, fever, night sweats, tiredness, weight loss, time spent with the patient, and other illnesses such as asthma or diabetes. They also logged household details like number of rooms, type of house, cooking fuel and ventilation, along with characteristics of the original TB patient such as how long they had been sick before starting treatment. All of this information was turned into numbers suitable for computer analysis, with careful methods to handle missing answers and to prevent rare events—like the small number of TB cases in the dataset—from being ignored by the models.

Letting algorithms search for patterns

The team then trained several types of machine learning models—computer programs that learn patterns from data—to guess which contacts had TB. These included familiar statistical tools, like logistic regression, and more flexible approaches such as Random Forests, Balanced Random Forests, K‑Nearest Neighbors, artificial neural networks and gradient boosting. Because the overwhelming majority of contacts did not have TB, the authors focused on "recall": the ability of a model to catch as many true TB cases as possible, even if that meant raising some false alarms. In public health, missing a sick person is usually more dangerous than testing an extra healthy one.

Figure 2
Figure 2.

What drove risk and which models worked best

Ensemble models that combine many simple decision rules, particularly Random Forest and its "balanced" variant, did the best job of finding true TB cases. They correctly identified about six out of seven people who had TB, while maintaining reasonable overall accuracy. The study also used a technique called SHAP to peek inside these "black box" models and see which factors mattered most. Being flagged as a presumptive TB case during screening, giving a sputum sample, having a long‑lasting or phlegmy cough, feeling very tired and losing appetite all strongly pushed a contact toward the "likely TB" side. Among household features, smaller house area (a sign of crowding) increased risk. Some characteristics seemed protective: being female, being taller, and living with an index patient who had more education were linked to lower risk, possibly reflecting differences in exposure, nutrition and access to care.

What this means for TB control

For health programs that must stretch limited resources, the findings offer a way to use routine home‑visit data more intelligently. Instead of treating all household contacts the same, clinics could run simple computer models in the background to flag those at highest risk for closer follow‑up, faster testing or preventive treatment. The study suggests that even in low‑resource settings, carefully designed machine learning tools can support earlier TB detection among family members, reduce missed cases and make contact investigations more efficient—provided that the models are tested and adapted in other regions before being woven into national TB strategies.

Citation: Wolde, H.M., Kebede, W., Yewhalaw, D. et al. Machine learning approaches to predict the risk of tuberculosis among household contacts of index TB patients in Central Ethiopia. Sci Rep 16, 10457 (2026). https://doi.org/10.1038/s41598-026-41547-7

Keywords: tuberculosis, household contacts, machine learning, risk prediction, Ethiopia