Clear Sky Science · en
A transparent AI assurance and benchmarking framework for EEG seizure detection on TUSZ seeded with a reproducible gradient-boosting ensemble
Why smarter seizure alarms matter
For people with epilepsy, doctors often rely on long recordings of brain activity to find seizures hidden in hours of otherwise normal patterns. Manually sifting through these electroencephalography (EEG) traces is slow, exhausting work, and automated seizure alarms could help — but only if they are trustworthy. This study introduces a transparent way to test and compare seizure-detection algorithms on a major public EEG database, and showcases a strong, carefully evaluated model built to meet realistic clinical constraints on missed events and false alarms.
Turning messy brain waves into a fair test bed
The authors focus on the Temple University Hospital EEG Seizure Corpus, a widely used collection of real-world scalp EEG recordings with expert-marked seizures. Although this dataset was designed with clear training and testing splits, many published studies have quietly bent those rules: mixing patients across partitions, using seizure-only clips, or judging performance on short segments instead of entire recordings. These choices can make algorithms look better than they truly are and prevent fair comparison. In response, the team defines an explicit, open protocol: a fixed split into training, development, and evaluation sets that never share patients; a clear rule for labeling one-minute windows as seizure or non-seizure; and a broad set of performance measures that reflect what clinicians actually care about, including how many false alarms occur per hour of monitoring.

A three-part AI to read EEG like a screening tool
Rather than deploy a deep neural network as a black box, the researchers build an interpretable system based on gradient-boosting decision trees. Every 60-second window of EEG, sliding forward in 15-second steps, is transformed into a rich collection of hand-crafted features. These capture how strong different brain rhythms are, how their shapes change over time, how synchronized activity is across regions, and how jagged or smooth the waves appear. On top of this, the model adds temporal context: for each window, it summarizes how those features evolve across neighboring windows, mimicking how a human reader judges patterns over time. Three related ensembles — a basic model, a full-context model, and a version tuned for extra sensitivity — each make predictions, which are then averaged into a single seizure probability for every window.
From raw scores to clinically realistic alarms
Simply ranking windows from most to least seizure-like is not enough; what matters in practice is how many seizures are caught for an acceptable number of alarms. The authors therefore treat threshold selection as an “alarm-budget” problem. On the development set, they jointly tune the decision threshold and a post-processing pipeline that smooths predictions over time, fills small gaps, merges nearby detections, and discards very short blips. Only parameter combinations that keep window-level specificity high and false alarms at or below about two-thirds of an alert per hour are considered. Among those, they choose the one that catches the most seizure events and then lock this policy before ever looking at the held-out evaluation set. This careful separation guards against overfitting and mirrors how a tool would be configured before deployment.

How well the system works — and where it struggles
Tested under these strict rules, the model distinguishes seizure from non-seizure windows reliably despite seizures being rare in the data. On the evaluation set, it achieves strong discrimination scores and, at the chosen operating point, correctly identifies about three quarters of seizure events while generating roughly 0.68 false alarms per hour of EEG — a burden similar to commercial hospital systems. Importantly, the detector covers about three quarters of the total seizure duration, turning the clinician’s task from searching a haystack into reviewing a shorter, high-yield list of candidate periods. Yet performance is not uniform: shorter seizures are much harder to detect, some patients experience many more false alarms than others, and some missed events show more subtle or focal patterns that the current hand-crafted features may underrepresent.
Seeing inside the model’s decision-making
Because the system relies on explicit features rather than opaque raw-wave filters, the authors can ask which properties of the EEG most influence its choices. Using model-interpretation tools, they find that changes in the main background rhythm, bursts of activity in slower bands, fluctuations in the strength of alpha waves, and increased waveform sharpness all play major roles — broadly in line with how clinicians recognize seizures. They also document typical mistakes: false alarms often coincide with movement or electrode artifacts that mimic seizure-like sharp transients, while misses frequently involve confined, slower rhythms that blend into the background. This kind of transparent analysis helps build confidence in what the model has learned and highlights concrete avenues for refinement.
What this means for future seizure detectors
The work’s central message is that meaningful progress in automated seizure detection depends as much on honest evaluation as on novel algorithms. By anchoring a patient-separated benchmark, fixing how alarms are derived from scores, and openly reporting trade-offs between seizure coverage and false alarms, the authors provide a reference point that future methods can fairly match or surpass. Their gradient-boosting system, while not perfect, shows that a thoughtfully engineered, interpretable model can deliver clinically relevant performance under realistic alarm budgets, and that transparent “AI assurance” — not just accuracy headlines — should guide the path from lab prototypes to bedside tools.
Citation: Zabihi, M., Gilmore, E.J., Ding, K. et al. A transparent AI assurance and benchmarking framework for EEG seizure detection on TUSZ seeded with a reproducible gradient-boosting ensemble. Sci Rep 16, 11283 (2026). https://doi.org/10.1038/s41598-026-41358-w
Keywords: EEG seizure detection, epilepsy monitoring, clinical AI benchmarking, machine learning in neurology, alarm burden in healthcare