Clear Sky Science · en

A genomic approach for accurate identification of closely related species with next-generation sequencing samples

· Back to index

Why this matters for farms and beyond

Modern DNA sequencing can read the genetic code of animals in astonishing detail, but even powerful computers can struggle with a surprisingly basic question: are these sequences from a sheep or a goat? For farmers, breeders, conservationists, and scientists, mixing up species in large DNA datasets can derail studies of health, productivity, and evolution. This paper introduces a simple but clever way to tell apart closely related species—demonstrated on sheep and goats—by looking not at every tiny difference in their DNA, but at a handful of stretches that act like species-specific barcodes.

Figure 1
Figure 1.

The problem with look-alike DNA

Sheep and goats share much of their genetic blueprint, so short DNA snippets from one often fit almost as well onto the other’s reference genome. The authors analyzed whole-genome sequencing data from 40 animals with known identities—20 sheep and 20 goats—each with hundreds of millions of DNA reads. Using standard tools that match reads to reference genomes, they found that both species’ DNA aligned extremely well to both the sheep and goat references. Alignment rates, coverage depth, and error measures were all very similar and showed heavy overlap, making it nearly impossible to say with confidence which species a sample came from based on these routine statistics alone.

Why standard DNA classifiers fall short

The team also tested Kraken2, a popular program that tries to assign each DNA read to a place in the tree of life. Even with a comprehensive database, reads from both sheep and goats were mostly classified into the same broad animal groups, with only slight numerical differences between them. Visualizations of these assignments showed that most reads from both species funneled into the same genera, reflecting how much of their DNA they share with each other and with other mammals. In practice, these blurred boundaries mean that traditional taxonomic tools can mislead researchers who assume that a labeled “sheep” dataset is really from sheep, or that a mislabeled sample will be easy to spot.

Turning missing coverage into a species barcode

Instead of asking how well DNA reads match a reference, the authors flipped the question: where do they not match? They aligned the training set of 30 animals (15 sheep, 15 goats) to both reference genomes and scanned for regions with a clear on–off pattern. A region counted as “goat-specific,” for example, if goat samples consistently showed normal coverage there when aligned to the goat genome, while sheep samples showed almost no coverage at the same position. Using strict cutoffs, this search produced more than 150,000 candidate regions in goats and over 1.7 million in sheep. After manual review focusing on longer, cleanly separated stretches, the team distilled this down to just ten high-confidence regions per species—short DNA zones where one species reliably “lights up” and the other stays dark.

Figure 2
Figure 2.

A simple test for unknown samples

With these 20 regions in hand, the authors designed a straightforward testing routine for any unlabeled DNA dataset. First, align the reads to both the sheep and goat reference genomes. Then, measure how much coverage—the pile-up of reads—falls inside the ten sheep-specific regions on the sheep genome and the ten goat-specific regions on the goat genome. If the sheep regions show strong coverage while the goat regions are nearly empty, the sample is a sheep; if the pattern is reversed, it is a goat. Applied to 14 independent validation samples, including publicly available data from different sequencing machines and even chemically modified DNA, this pattern-based test correctly identified every single sample, achieving 100% accuracy in the studied set.

New tools and future uses

Beyond solving a practical problem for sheep and goat research, this work offers a general blueprint that could be adapted to other pairs—or groups—of closely related species. The curated regions serve as building blocks for future tools, from quick lab tests that amplify only those species-specific stretches, to automated software that screens old sequencing datasets for mislabeling. Although the method does require aligning data to multiple reference genomes, which costs computing time and storage, it sidesteps many pitfalls of traditional approaches and is robust to differences in breeds and sequencing platforms. In everyday terms, the authors have shown how a tiny number of carefully chosen DNA landmarks can give a clear, reliable answer to a question that big, complex algorithms often get wrong: which animal is this?

Citation: dain Marzouka, N.a., Al-Aamri, A., Alshamsi, F. et al. A genomic approach for accurate identification of closely related species with next-generation sequencing samples. Sci Rep 16, 11329 (2026). https://doi.org/10.1038/s41598-026-41497-0

Keywords: species identification, whole genome sequencing, sheep and goats, comparative genomics, animal genetics