Clear Sky Science · en

Machine learning-based proteogenomic data modeling identifies circulating plasma biomarkers for early detection of lung cancer

· Back to index

Why this research matters

Lung cancer kills more people worldwide than any other cancer, largely because it is usually found too late. Today’s screening tools mainly focus on heavy smokers and rely on imaging scans that can miss early disease. This study asks a simple but powerful question: can a routine blood sample, collected years before symptoms appear, reveal who is quietly moving toward lung cancer? By combining genetic data with thousands of blood proteins and modern machine learning, the researchers search for early warning signals that could someday broaden screening and save lives.

Looking for clues in genes and blood

The team first examined DNA from hundreds of thousands of people in large population biobanks in the United Kingdom and Finland. They compared the genetic codes of people who developed lung cancer with those who did not, pinpointing stretches of DNA linked to higher risk. Next, they asked whether those same genetic changes were tied to differences in specific proteins circulating in the blood. Proteins are the body’s workhorse molecules, and shifts in their levels can reveal early biological stress long before a tumor shows up on a scan. By connecting risk genes to blood protein levels, the researchers began to map how inherited susceptibility might subtly reshape the body’s internal chemistry on the path to lung cancer.

Figure 1
Figure 1.

Following blood signals years before diagnosis

The second, complementary part of the study focused directly on blood proteins as possible early signals of disease. Using a high-throughput platform, the scientists measured nearly 3,000 different proteins in blood samples from more than 26,000 volunteers in the UK Biobank. Some people were already diagnosed with lung cancer when their blood was drawn, but many others developed the disease only years later. The researchers grouped these “future patients” based on when they were diagnosed: within 0–4 years, 5–9 years, or anywhere within 0–9 years after giving blood. They then compared protein levels between each group and cancer-free participants to find proteins that consistently differed long before diagnosis.

Teaching computers to spot high-risk profiles

Because no single protein told the whole story, the team turned to machine learning to interpret complex patterns across hundreds of markers at once. They trained several types of algorithms—including random forests and neural networks—to distinguish people who would go on to develop lung cancer from those who remained cancer-free, using only their blood protein profiles. The models performed well, reaching accuracy scores (AUCs) around 0.8–0.88, even when using samples taken up to nine years before diagnosis. Notably, models built from protein data clearly outperformed those based only on standard risk factors such as age, sex, and smoking history, showing that the blood signals add meaningful information beyond what doctors already know.

Figure 2
Figure 2.

What the key proteins reveal

Across the different time windows, the researchers repeatedly identified a core set of 22 proteins whose levels were strongly linked to future lung cancer. Fourteen of these had been connected to lung cancer before, while eight emerged as new candidates. Many of the proteins are involved in immune responses, inflammation, and scarring processes in lung tissue, suggesting that early lung cancer may reshape the body’s defense systems long before it can be seen on imaging. In people whose blood was drawn 5–9 years before diagnosis, higher levels of several proteins were also tied to worse survival once cancer appeared, hinting that the same early markers may carry information about how aggressive a future tumor could be.

What this means for patients

This work does not yet deliver a ready-to-use blood test, and it does not prove these proteins cause lung cancer. Instead, it offers a detailed map of how genes and blood chemistry shift in the years leading up to diagnosis, and it highlights specific circulating proteins that deserve deeper study as early warning markers. If future research confirms and refines these findings, a simple blood draw could one day help identify high-risk individuals—including some lifelong nonsmokers—years before symptoms arise, guiding more timely scans, closer monitoring, and ultimately more lives saved.

Citation: Johnson, M.A., Nieves-Rodriguez, S., Hou, L. et al. Machine learning-based proteogenomic data modeling identifies circulating plasma biomarkers for early detection of lung cancer. Commun Med 6, 253 (2026). https://doi.org/10.1038/s43856-026-01500-1

Keywords: lung cancer, blood biomarkers, proteomics, genetic risk, early detection