Clear Sky Science · en

Integrative analysis of in silico predictions and clinical evidence to delineate the capability of HiFi long-read sequencing in paralogous genes

2026-03-03 · Back to index

Why untangling look‑alike genes matters

Our DNA contains many pairs or families of "twin" genes that look almost identical. These copy‑and‑paste stretches of code are important for health and disease, but they are notoriously hard to read correctly with standard genome tests. This study asks a practical question for medicine: how far can today’s long‑read DNA sequencing really go in separating these confusing gene copies, where does it still fail, and can smart software help close the remaining gaps?

When gene copies fool standard DNA tests

Modern genetic testing often relies on short snippets of DNA, only a few hundred letters long. When these short reads come from regions where gene copies are almost identical, a computer cannot easily tell which copy they belong to. This can blur or hide important disease‑related changes. To capture this problem in a way that does not depend on any single patient or machine, the authors used a concept called "mappability": for a chosen read length, they asked how often that exact sequence appears in the reference genome. If a read could match in several places, that region was marked as hard or impossible to resolve with short reads.

What longer DNA reads can fix—and what they cannot

The team first mapped out which parts of the genome would be troublesome for widely used short‑read sequencing and found that 645 medically important genes fell into this danger zone. They then repeated the calculation for much longer reads, up to 14,000 DNA letters, similar to what cutting‑edge long‑read technology now produces. With these long reads, about two‑thirds of the previously problematic genes were predicted to become clearly readable, but roughly one‑third remained stubbornly unresolved. When they compared these predictions with real clinical long‑read data from 66 people, genes predicted to be "fixable" did, in fact, show high‑confidence mapping much more often than those predicted to stay difficult, confirming that the simulations captured real‑world behavior.

Testing real‑world accuracy, not just theory

The researchers went beyond mapping scores and looked directly at how well genetic variants could be detected in the tricky regions. Using a well‑studied reference genome, they showed that long‑read sequencing found both single‑letter and small insertion/deletion changes more completely than short‑read methods in areas known to be hard to map. Short reads missed many true changes in these regions, while long reads picked up nearly all of them, although they still produced somewhat more uncertain calls than in easy parts of the genome. By modeling how the number of unsolved genes shrinks as read length grows, they found a curve that drops quickly up to about 7–8 thousand letters and then flattens out, suggesting that simply making reads even longer will not remove all blind spots.

Using smart software to separate tangled gene copies

To tackle the genes that stayed confusing even with long reads, the authors turned to a specialized phasing tool called Paraphase. Instead of looking at each read in isolation, this software groups reads into distinct "haplotypes"—coherent versions of each gene copy—by re‑aligning them to a common pattern and tracking how sequence changes travel together. Applied to 79 groups of highly similar genes that were predicted or observed to be difficult, Paraphase was able to reconstruct nearly complete, clean haplotypes for over three‑quarters of them. In detailed examples, such as a gene pair involved in hearing, reads that were previously jumbled between copies could be neatly assigned to separate tracks, illustrating how algorithmic insight can overcome limits of raw read length.

What this means for future genetic diagnosis

For non‑specialists, the main message is that longer DNA reads already make a big difference for genes that have confusing twins, and they clearly outperform traditional short‑read tests in many medically important regions. Yet even the best current long‑read technologies cannot fully solve every tangled gene family, because some stretches of the genome are simply too repetitive. This study shows that combining three elements—careful computer predictions of where trouble is likely, real‑world clinical long‑read data, and dedicated phasing software—provides a practical roadmap for which genes can be trusted, which need extra caution, and where new methods are most urgently needed. In clinical genomics, that kind of clear boundary‑setting is essential for turning ever‑better sequencing into truly reliable diagnoses.

Citation: Kim, S.K., Jang, J., Kim, Y. et al. Integrative analysis of in silico predictions and clinical evidence to delineate the capability of HiFi long-read sequencing in paralogous genes. npj Genom. Med. 11, 21 (2026). https://doi.org/10.1038/s41525-026-00555-2

Keywords: long-read sequencing, paralogous genes, clinical genomics, genome mappability, haplotype phasing