Clear Sky Science · en
Illuminating cell states by a comprehensive and interpretable single cell foundation model
Why smarter cell maps matter
Every tissue in your body is a bustling city of cells, each with its own job and life story. Modern tools can read the activity of thousands of genes in millions of individual cells, but this flood of data is messy, patchy, and hard to interpret. This paper introduces CellVQ, a new artificial intelligence model designed to turn those tangled measurements into clear, human-understandable maps of cell types, cell states, and their responses to drugs and genetic changes.

A new way to read single cells
The authors start from a simple idea: to understand health and disease, we need a reliable “language” for describing what state each cell is in. Existing AI models trained on single-cell data are powerful but struggle with three real-world problems. First, most measurements are extremely sparse, with many genes apparently silent. Second, different labs and techniques produce data on different scales, making it hard to compare results. Third, the internal workings of these models are often opaque, which limits their usefulness to biologists who want clear explanations, not just predictions.
Turning cell activity into a reusable cell code
CellVQ tackles these issues with a large model trained on 68 million cells that learns a compact “cell code” for each cell. Rather than representing each cell as a long list of raw numbers, CellVQ passes gene activity patterns through an encoder and a special Single-Cell Discretization module. This module groups similar patterns into shared codes, so cells from different experiments that behave alike end up with related codes. At the same time, a decoder learns to reconstruct missing gene activity using a statistical model tailored for data with many zeros. This training strategy helps the system cope with sparse measurements while capturing meaningful relationships between genes.
From raw data to useful predictions
Once trained, CellVQ can be applied to many tasks without extra fine-tuning. The model separates cell types more cleanly than competing methods, leading to sharper clusters and more accurate automatic labeling of cell identities. It also predicts practical properties such as tissue of origin, age, sex, and disease status better than earlier approaches. Remarkably, the same representations work well on bulk samples that average many cells together, boosting performance in predicting how cancer cells respond to different drugs and how sensitive patients or cell lines might be to specific treatments.

Revealing how genes and drugs reshape cells
The study further tests whether CellVQ captures cause-and-effect relationships when genes or drugs are perturbed. Using datasets where individual genes are switched off or combinations are altered, CellVQ helps forecast how the rest of the genome responds at single-cell resolution, often matching or surpassing specialized models. For drug exposures, the authors combine CellVQ’s gene representations with a separate model that reads drug structures, and together these systems accurately predict how gene activity changes in immune cells treated with specific compounds. The method can pinpoint which genes shift the most, offering clues to drug action and side effects.
Building knowledge graphs of cell states
To make the model’s inner logic accessible, the authors introduce CellVQ-Graph, a lightweight add-on that uses CellVQ’s outputs to build a graph linking cells, genes, and descriptive properties such as tissue, disease label, age, and sex. In this graph, attention weights highlight which genes and features matter most for each cell state. Applied to brain and pancreas data, the system separates subtle subtypes of cells, proposes intermediate states, and calls out well-known marker genes alongside less studied candidates. It also infers networks of genes that tend to move together, shedding light on regulatory circuits that control development, stress responses, and inflammation.
What this means for future cell research
In everyday terms, CellVQ and CellVQ-Graph act like a powerful translation and mapping engine for cellular life, converting noisy measurements into a shared code that can be compared across studies and diseases. The work shows that one model can both improve prediction tasks and offer clear biological clues, from key marker genes to likely gene-gene partnerships. While the current version is trained mainly on one type of molecular readout, the authors plan to extend it to more data types, aiming for a unified, interpretable atlas of how cells change over time, in different tissues, and under treatment.
Citation: Wang, J., Tan, C., Gao, Z. et al. Illuminating cell states by a comprehensive and interpretable single cell foundation model. Nat Commun 17, 4037 (2026). https://doi.org/10.1038/s41467-026-70071-5
Keywords: single-cell RNA sequencing, cell states, foundation model, gene regulation, drug response