Clear Sky Science · en

Comparison of state-of-the-art error-correction coding for sequence-based DNA data storage

2026-03-14 · Back to index

Storing Tomorrow’s Data in Nature’s Hard Drive

Imagine fitting all the world’s data into something you could hold in your hand. DNA, the molecule that stores genetic information in living things, can in principle hold millions of times more data per gram than today’s hard drives. But turning movies, photos, and archives into strands of DNA and reading them back perfectly is tricky. This study asks a practical question: with today’s technology and software, how close are we to using DNA as a serious data vault, and which digital “translation” methods work best?

How Digital Files Become DNA

To save data in DNA, you must convert ordinary computer bits into a sequence of the four DNA building blocks, then have that sequence manufactured in the lab, stored, and later read back with DNA sequencers. Along the way, many things can go wrong: some DNA strands disappear entirely, others pick up extra or missing letters, and still others are copied unevenly so that some sequences are over‑represented while others are rare. To cope with this, researchers design special encoder–decoder software, called codecs, that add redundancy—extra information that lets you fix errors and recover missing pieces. The authors selected six well‑known codecs from the literature and standardized them so they could be compared fairly under the same conditions.

Testing DNA Memory in the Computer

The team first ran exhaustive computer simulations to probe how well each codec could survive different kinds of damage. They simulated millions of short DNA fragments, each carrying a piece of a test file, then randomly added substitutions, missing letters, extra letters, or even removed entire sequences. By repeating these experiments many times, they determined the highest error and loss rates at which each codec could still recover the file with high reliability. A key step was “clustering” the many noisy copies of each DNA strand and merging them into a cleaner consensus sequence before decoding. This simple trick roughly doubled error tolerance and also sped up decoding, because the codecs had fewer, higher‑quality sequences to process.

From Clean Labs to Messy Reality

Real DNA storage systems differ widely in how they synthesize and handle DNA, so the authors modeled two practical workflows. A “high‑fidelity” path used a modern commercial DNA printer and accurate copying enzymes, producing low error rates and little strand loss. A “low‑fidelity” path used a cheaper, more error‑prone synthesis method and a rougher copying step, which introduced more mistakes and missing strands. Within each path they varied how many physical DNA copies were stored and how deeply the pool was sequenced, revealing a trade‑off between storage density, sequencing cost, and reliability. Some codecs handled random letter errors very well but failed when many entire strands were missing; others were better balanced. Three approaches—DNA‑Aeon, DNA‑RS, and a graph‑based method called DBGPS (tested in silico)—emerged as the most robust across both error types.

Pushing DNA Storage Toward Its Limits

To ensure the simulations matched reality, the researchers carried out lab experiments following both the high‑ and low‑fidelity workflows, using two commercial DNA synthesis technologies. They encoded small image files into over 11,000 DNA sequences with all six codecs, then amplified, diluted, and re‑sequenced the pools. After artificially limiting the sequencing depth to reflect realistic read budgets, they tested whether the original files could still be decoded. The best codecs successfully recovered data at storage densities of about 43 exabytes (billion gigabytes) per gram of DNA with the high‑quality workflow, and about 13 exabytes per gram with the low‑quality workflow—substantially higher than previous experimental records and within roughly an order of magnitude of the theoretical limit.

What This Means for Future DNA Archives

The study shows that today’s error‑correction methods for DNA data storage are already surprisingly mature. With carefully chosen codecs and workflows, it is possible to store data at extreme densities while tolerating significant errors and strand loss. It also highlights that simple tests, such as only counting how many extra bits a codec adds or running toy error simulations, can be misleading; realistic benchmarks must consider both missing strands and letter‑level errors, and should compare against proven state‑of‑the‑art methods. For non‑experts, the message is clear: DNA is no longer just a futuristic idea for storing information. The software machinery needed to read and write reliable DNA archives is in place, and further progress will come from refining lab methods and scaling up, rather than inventing entirely new codes.

Citation: Gimpel, A.L., Remschak, A., Stark, W.J. et al. Comparison of state-of-the-art error-correction coding for sequence-based DNA data storage. Nat Commun 17, 3963 (2026). https://doi.org/10.1038/s41467-026-70548-3

Keywords: DNA data storage, error correction, data density, coding theory, synthetic biology