Clear Sky Science · en

A comparative analysis of data-driven models for breast cancer survival prediction

· Back to index

Why this study matters for women’s health

Breast cancer is now the most commonly diagnosed cancer in women worldwide, and in countries with fewer medical resources it is often found late and treated under difficult conditions. This study focuses on women in Ethiopia and asks a life‑or‑death question: given the information doctors already collect about a patient, can modern data tools more accurately predict who is at highest risk of dying from breast cancer? Better predictions could help doctors prioritize limited treatments, schedule closer follow‑up for the most vulnerable patients, and give women clearer information about their outlook.

Women, hospitals, and everyday medical records

The researchers analyzed medical records from 1,164 women treated for breast cancer between 2019 and 2024 at two major Ethiopian hospitals. For each woman, they tracked how long she lived after diagnosis and whether she died during the follow‑up period or was still alive when the data collection ended. Alongside this outcome, they used common clinical and social information that hospitals routinely record: age, tumor size, cancer stage, whether the cancer had spread to distant organs (metastasis) or lymph nodes, other illnesses, marital status, lifestyle habits such as smoking or khat use, and whether the woman had breastfed. These are all details that can be gathered without expensive tests, making any resulting prediction tools realistic for low‑resource settings.

Figure 1
Figure 1.

Old and new ways to read survival chances

Traditionally, doctors and statisticians use survival analysis methods such as Kaplan–Meier curves and the Cox proportional hazards model to understand how long patients live with a disease and which factors affect that time. These methods are relatively easy to interpret but struggle when many factors interact in complex, non‑linear ways, as is often the case in real‑world cancer care. The authors compared these classical approaches with more flexible machine‑learning models, including random survival forests and deep‑learning survival models, as well as standard classification tools like support vector machines, random forests, XGBoost, and LightGBM. All models were trained on part of the data and tested on unseen cases, and their performance was judged with measures that capture both how well they rank patients by risk and how well their predicted survival times match reality.

Which factors shape survival the most?

Across the entire group, several patterns stood out even before applying advanced models. Women with larger tumors, more lymph nodes involved, or cancer that had already spread had much poorer survival. Those diagnosed at stage IV were especially likely to die during follow‑up, while women with stage I disease had far better outcomes. Older age, particularly 45 years and above, and the presence of other illnesses such as chronic disease also worsened survival. Lifestyle habits like smoking, alcohol, or khat use were linked to poorer outcomes as well. Married women tended to live longer than single, divorced, or widowed women, echoing findings from other countries that social support can improve cancer survival by helping patients stay engaged with care.

Figure 2
Figure 2.

What smart algorithms added to the picture

When the team compared models, random survival forests—a method that grows many survival‑focused decision trees and combines their results—delivered the most accurate predictions of how long patients would live. A closely related method, random forests used as a classifier, was the best at distinguishing higher‑risk from lower‑risk women. To avoid the “black box” problem, the researchers used a technique called SHAP to see which factors the models relied on most. Across the strongest models, the same features kept rising to the top: age, tumor size, metastasis, lymph node involvement, overall stage, and the presence of other illnesses. Social features like marital status and certain habits also contributed, but to a lesser degree. In effect, the models learned and quantified the same key risk signals that clinicians worry about, while also weighing how they combine in subtle ways.

What this means for patients and clinics

The study concludes that, for Ethiopian women with breast cancer, data‑driven survival models tailored to time‑to‑death prediction—especially random survival forests—can provide more accurate and still interpretable risk estimates than traditional methods alone. Because these models use information already collected in routine care, they could be built into simple tools that flag high‑risk patients, help doctors decide who needs faster referral or more intensive treatment, and guide honest yet personalized conversations about prognosis. While the work has limits—it lacked genetic and imaging data and relied on retrospective records—it shows that carefully applied machine learning can turn ordinary hospital data into practical support for cancer care in resource‑constrained settings.

Citation: Takele, K., Chen, DG. A comparative analysis of data-driven models for breast cancer survival prediction. Sci Rep 16, 10114 (2026). https://doi.org/10.1038/s41598-026-40565-9

Keywords: breast cancer survival, machine learning, random survival forest, Ethiopia, clinical risk factors