Clear Sky Science · en
The Harvard-Emory ECG Database
Why a Giant Heartbeat Library Matters
The electrical beats of the heart, captured in a simple test called an electrocardiogram (ECG), are among the most common measurements in modern medicine. Yet until now, researchers have had surprisingly few very large, well-organized ECG collections to study. The Harvard-Emory ECG Database (HEEDB) changes that: it brings together millions of ECG recordings from everyday hospital care, along with information about who the patients were and what happened to them. This massive “heartbeat library” could help scientists find early warning signs of heart trouble and build fairer, more accurate computer tools for doctors.

A Massive Collection of Heart Signals
HEEDB is currently the largest open-access collection of standard 12‑lead ECGs, the type used in clinics and emergency rooms worldwide. It contains more than 11.6 million, ten‑second recordings from over 2.1 million patients seen at Massachusetts General Hospital in Boston and Emory University Hospital in Atlanta between 1980 and 2022. Many people in the database had several ECGs taken over months or years, providing a timeline of how their heart patterns changed as they aged, became ill, or recovered. By opening this resource to qualified researchers, the team aims to enable population‑scale studies of heart rhythms, their disturbances, and how these patterns relate to health outcomes such as heart failure, dangerous arrhythmias, and sudden death.
Who the Patients Are and How Their Data Are Protected
The database does not just store waveforms; it also includes rich background information for each person. For most patients, researchers can see age, sex, and race, while one hospital also provides details like education level, language, and veteran status. Dates such as birth, ECG recording, last hospital visit, and death are available in a carefully altered form: each patient’s dates are randomly shifted by up to a year, and anyone older than 89 is grouped into a single age bracket. Direct identifiers are removed, and each person is assigned a new code that is consistent across related projects. These steps follow established privacy rules and were approved by ethics boards, with data access controlled by a usage agreement that forbids attempts to “re‑identify” individuals.
Layers of Medical Meaning on Top of Each Heartbeat
Every ECG in HEEDB is linked to several layers of interpretation. First, there are computer‑generated statements from widely used commercial ECG analysis software, which flags rhythm types and possible problems such as prior heart attacks or abnormal electrical patterns. These labels were regenerated for all recordings using the latest version of the software so that researchers can compare patients over decades in a consistent way. Second, for many ECGs, the database also includes what human physicians wrote when they reviewed the traces at the bedside. Because these notes were typed as free text, the team used natural language processing methods to translate them back into standardized computer codes. They then measured how closely the automated and human interpretations agreed, generally finding strong overlap but also highlighting where the computer and doctor saw things differently.
Connecting Heart Patterns to Diagnoses and Disease History
Beyond what is visible on each ECG strip, the database links every patient to diagnosis codes drawn from their electronic health records. These codes, from long‑standing international systems (ICD‑9 and ICD‑10), summarize conditions ranging from high blood pressure and diabetes to heart rhythm disorders and lung disease, along with the dates when those diagnoses were made. Some patients have only a few codes, while others have hundreds, reflecting complex medical histories. The most common codes in both hospitals relate to essential hypertension, underlining how widespread high blood pressure is among people who receive ECG testing. Importantly, the authors stress that ECG‑based labels and diagnosis codes capture different aspects of care and may refer to different visits, so researchers must decide carefully how to combine them.

Strengths, Limitations, and How Researchers Can Use It
Because the ECGs were collected during ordinary clinical care using the same brand of equipment, the data are consistent but also contain real‑world imperfections such as noise and missing leads. The authors provide basic quality flags and technical notes but deliberately leave further cleaning and selection to end users, who may have different research goals. They also caution that all recordings come from two large U.S. academic centers using one vendor’s system, so findings might not fully generalize to other regions or devices. Even so, the size of the dataset, the diversity of the patients, and the availability of both automated and physician interpretations make HEEDB a powerful testbed for new algorithms and for studying bias across demographic groups.
What This Means for Future Heart Care
In essence, the Harvard‑Emory ECG Database turns millions of routine heart tests into a shared scientific resource. For a non‑specialist, its value lies in the possibility that patterns hidden in these recordings could reveal who is at risk for serious heart problems long before symptoms appear, and whether current tools work equally well for people of different ages, sexes, and backgrounds. By making carefully de‑identified data broadly available, the project lays the groundwork for more precise, data‑driven cardiology and for computer‑assisted decision tools that are both powerful and fair.
Citation: Koscova, Z., Li, Q., Robichaux, C. et al. The Harvard-Emory ECG Database. Sci Data 13, 516 (2026). https://doi.org/10.1038/s41597-026-06861-9
Keywords: electrocardiogram, cardiovascular disease, medical datasets, machine learning in medicine, heart rhythm