Clear Sky Science · en

High-resolution phage-host assignment through key proteins using large language models

· Back to index

Hunting Invisible Viruses in Our Gut

Every person carries trillions of bacteria and their viruses in the gut, many of which are still unknown. These hidden viruses may shape our health, from digestion to obesity, yet scientists often do not know which virus infects which bacterium. This study introduces VirHost Hunter, a new data-driven tool that links gut viruses to their bacterial hosts using only a few key viral proteins, opening the door to more precise ways to study and potentially steer the microbiome.

A New Way to Match Viruses and Bacteria

Traditional methods for pairing viruses with their bacterial hosts rely on full viral genomes or special genetic clues such as CRISPR markers. These approaches work only when the right reference data exist and can miss a large fraction of viral sequences, often called viral dark matter. The authors instead focus on two types of viral proteins that are central to infection: tail proteins, which help a virus recognize and attach to a bacterium, and lysins, which help it break open the bacterial cell wall. By concentrating on these proteins, they avoid the noise of unrelated genes and can work even when only fragments of a viral genome are available.

Figure 1. How key viral proteins help match hidden gut viruses to their bacterial hosts and reshape our view of the microbiome.
Figure 1. How key viral proteins help match hidden gut viruses to their bacterial hosts and reshape our view of the microbiome.

Teaching Computers the Language of Proteins and DNA

To read meaning in these proteins, the team turns to machine learning techniques originally developed for human language. They use a protein language model called ProtT5 to convert amino acid sequences into dense numerical patterns that capture hidden functional similarities, even when sequences look very different at first glance. In parallel, they analyze the DNA that encodes these proteins using a Vision Transformer model and a multi-path convolutional network, which together pick up features such as typical codon usage and long-range patterns along the DNA. These protein and DNA signals are then merged and fed into a pair of classifiers that jointly decide which bacterial family, genus, or species a given virus is likely to infect.

Sharper and Deeper Host Predictions

The researchers tested VirHost Hunter on several benchmark collections of bacteriophages. They show that combining protein and DNA information clearly outperforms using either alone, and that concentrating on tail proteins and lysins gives better predictions than using other viral parts like heads or packaging enzymes. Across different levels of bacterial classification, VirHost Hunter is more accurate than existing alignment-free tools and remains reliable even when viruses share only low sequence similarity. When evaluated on cultivated gut phages with experimentally known hosts, it identifies correct hosts at higher precision than a standard CRISPR-based method, and using both approaches together improves results even further.

Revealing Hidden Gut Viruses Linked to Disease

Armed with the calibrated model, the team applied VirHost Hunter to a large human Gut Phage Database that previously had host information for less than one third of its entries. By scanning tail and lysin proteins, they nearly doubled the share of phages with assigned hosts and uncovered viruses that target 29 families of gut bacteria, many tied to chronic conditions such as inflammatory bowel disease, heart disease, and obesity. Notably, they found dozens of previously uncharacterized phages predicted to infect bacteria like Akkermansia muciniphila and Prevotella copri, which have been implicated in autoimmune and metabolic disorders but lacked known phages.

Figure 2. Step-by-step view of a gut virus using tail proteins and lysins to recognize and break open a specific target bacterium.
Figure 2. Step-by-step view of a gut virus using tail proteins and lysins to recognize and break open a specific target bacterium.

From Digital Predictions to a Targeted Antimicrobial

To turn these predictions into a practical resource, the authors built a Gut Phage Lysin Database containing more than one hundred thousand lysins with mapped gut hosts. They examined their structures, stability, and diversity, revealing many distinct clusters and conserved motifs responsible for breaking bacterial cell walls. As a proof of concept, they selected one lysin predicted to specifically target Megamonas, a bacterium associated with obesity. After synthesizing this protein, they showed in lab tests that it efficiently kills Megamonas while sparing other common gut microbes and probiotic strains, illustrating how model-guided mining of viral dark matter can yield highly selective tools.

Why This Matters for Future Microbiome Care

This work shows that it is possible to link vast numbers of unknown gut viruses to their bacterial hosts using just a few key proteins and modern machine learning. By illuminating who infects whom in the microbiome, VirHost Hunter boosts our ability to explore gut viral diversity and to design precise interventions, such as tailored lysins, that selectively curb harmful bacteria without disturbing the broader microbial community. While further testing and engineering are needed before clinical use, the framework provides a powerful roadmap for converting hidden viral sequences into targeted strategies for studying and, one day, tuning our inner ecosystem.

Citation: Du, Z., Li, M., Lin, K. et al. High-resolution phage-host assignment through key proteins using large language models. Nat Commun 17, 4439 (2026). https://doi.org/10.1038/s41467-026-70613-x

Keywords: gut virome, bacteriophages, machine learning, phage lysins, microbiome therapy