Clear Sky Science · en
From single-sequences to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2
Why this matters for future pandemics
For most of the COVID‑19 pandemic, scientists played catch‑up: new variants appeared in the real world before labs could measure what those changes meant for infectiousness or immune escape. This study shows that powerful computer models, originally designed to understand human language, can instead read the "language" of proteins and infer how the coronavirus spike protein is likely to change and adapt—using only its sequence of building blocks. That capability could help researchers flag worrying variants earlier and may generalize to many other pathogens.
Teaching computers to read proteins
The authors work with a protein language model called ESM‑2, trained on tens of millions of protein sequences from across the tree of life. Much like a language model learns grammar and meaning from words, ESM‑2 learns which amino acid patterns "make sense" in real proteins. When given the sequence of the SARS‑CoV‑2 spike protein, the model assigns each possible mutation two key scores: a grammaticality score that reflects how well a changed sequence fits the learned rules of protein structure, and a semantic score that gauges how different the overall protein becomes in the model’s internal representation. These scores can be computed for every possible single mutation on a computer, a strategy known as in silico deep mutational scanning. 
Mapping where the virus can and cannot change
By scanning all single‑letter changes across spike, the team found that ESM‑2 naturally recovers the major architectural features of the protein. The S2 portion, which forms the stable stalk that drives membrane fusion, is predicted to be highly constrained: most mutations there sharply lower grammaticality, implying they would damage protein structure or function. In contrast, regions on the outer surface of the S1 portion, including the N‑terminal domain and the receptor‑binding domain, tolerate many more changes. This matches what is seen in real virus genomes, where these exposed regions accumulate mutations that help the virus attach to cells and evade antibodies while the structural core remains more conserved.
Uncovering hidden teamwork between mutations
Proteins are not just a set of independent sites; one mutation can change how acceptable others are, a phenomenon called epistasis. The researchers probed this by starting from the Omicron BA.1 spike and computationally “reverting” its defining mutations one at a time back to the original Wuhan sequence. Each reversion alters the model’s likelihoods for amino acids at every other position. Large shifts reveal pairs of sites whose fates are linked. Using this approach, the study highlights known hotspots such as positions around 484 and 501 in the receptor‑binding domain, which together shape both immune escape and binding to the ACE2 receptor. It also points to less obvious clusters of residues whose interactions were later confirmed in experimental studies of Omicron’s enhanced growth in human nasal cells, suggesting the model is capturing genuine structural and functional couplings. 
Following viral evolution and spotting outliers
Beyond single mutations, the authors ask whether ESM‑2 can make sense of whole variant sequences as they appeared over time. They embed one spike sequence for each named SARS‑CoV‑2 lineage and place them in a two‑dimensional map using an approach called evo‑velocity, which also infers a dominant direction of change. The resulting layout mirrors the known phylogenetic tree: early lineages cluster together, then branches corresponding to Alpha, Delta, Omicron and recombinant lineages peel off in the correct temporal order. Simple summary statistics such as average grammaticality and semantic distance cleanly separate non‑variant lineages, early variants of concern, and Omicron‑class viruses, showing that the model’s internal representation tracks meaningful evolutionary shifts.
Turning embeddings into an early warning system
To explore practical surveillance, the team introduces a dynamic semantic score: each new spike sequence is compared not just to the original Wuhan strain but to the average of viruses circulating in the previous three months. Applied to dense sequencing data from the United Kingdom, this moving score produces distinct waves as Alpha, Delta and successive Omicron sublineages rise and fall. Sequences that lie one or two standard deviations away from the current average are flagged as potential sequences of concern. Using only these early outliers, the method would have highlighted most World Health Organization variants of concern and several important later offshoots such as JN.1, while also revealing which particular sites in the spike protein are repeatedly altered in emerging lineages.
What this means for future threats
Overall, the study shows that a general‑purpose protein language model, used straight off the shelf, can identify which parts of the SARS‑CoV‑2 spike protein are flexible, which sites are structurally critical, how mutations work together, and how the virus’s spike has wandered through evolutionary space over the course of the pandemic. Because the method works from a single protein sequence and does not rely on pre‑existing alignments or detailed structural data, it could be applied very early in an outbreak, when only a handful of genomes are known. As similar models are refined and tuned to viral datasets, they may become an important part of the toolkit for forecasting how new pathogens evolve and for prioritizing variants for laboratory study and vaccine design.
Citation: Lamb, K.D., Hughes, J., Lytras, S. et al. From single-sequences to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2. Nat Commun 17, 2938 (2026). https://doi.org/10.1038/s41467-026-69569-9
Keywords: protein language models, SARS-CoV-2 spike, viral evolution, epistasis, variant surveillance