Clear Sky Science · en
Evaluating large language models for diagnostic reasoning from unstructured clinical narratives in epilepsy
Why this matters for patients and doctors
When someone has a seizure, the way it looks and feels can offer vital clues about what is happening inside the brain. Doctors use these descriptions to decide where in the brain seizures start and which treatments, including surgery, might help. This study asks whether large language models, the same kind of artificial intelligence behind chatbots, can reliably interpret real-world seizure descriptions and support this kind of diagnostic reasoning.

Turning seizure stories into brain clues
The researchers focus on epilepsy, a condition where brief surges of abnormal brain activity cause seizures. In everyday care, clinicians listen carefully to patients and witnesses, noting features such as chewing movements, odd sensations, or violent limb thrashing. These details often point to specific brain areas, like the temporal or frontal lobes. The team built on a large public dataset in which more than 1200 seizure descriptions had already been linked to seven broad brain regions based on surgery results that left patients seizure free, a strong sign that the true seizure source had been removed.
Putting many AI models to the test
Eight different language models were evaluated, including widely used general systems and two models tuned on medical text. Each model received a seizure description and had to output how likely it was that the seizure began in each of the seven brain regions. The researchers examined not only how often the top choice was correct, but also how confident the models seemed, how well that confidence matched reality, and how sensible their written explanations were. They compared the results with a simple baseline that always chose the most common brain region and with two human epilepsy specialists who rated a subset of cases.

How prompt wording shapes AI behavior
The way the task was phrased for the models had a major impact. When given only basic instructions, most systems did only slightly better than chance. Performance improved when the models were shown a few example cases, asked to think step by step, or given expert-written examples of clinical reasoning to imitate. The strongest gains came from prompts that encouraged detailed reasoning and from combining multiple independent answers to reach a more stable decision. Under these richer instructions, the best systems approached the accuracy of human clinicians on this specific task, while also becoming more consistent and better calibrated in their confidence.
Strengths, blind spots, and the human check
A closer look revealed important caveats. Clinical experts reviewed the reasoning produced by the two best models. One of them, GPT-4, more often showed sound understanding of symptoms, accurate use of epilepsy knowledge, and coherent logic. It also tended to cite real scientific papers correctly. Another strong performer, Mixtral-8×7B, sometimes reached the right answer for the wrong reasons, misreading symptom details or inventing supporting facts and references. The study also showed that performance depended on how long the seizure description was, which clinical role the model was asked to impersonate, and what language was used. Very short or very detailed descriptions worked best, pretending to be a specialist improved results, and using non-English prompts could reduce accuracy.
What this means for future care
The authors conclude that large language models can, in a controlled setting, turn unstructured seizure stories into useful estimates of where seizures start in the brain. With carefully designed prompts, their performance can come close to that of experienced clinicians, at least for the narrow task of mapping seizure signs to broad brain regions. At the same time, the models can sound convincing while relying on flawed reasoning or made-up sources. This mix of promise and risk means such systems might one day help triage cases or support early diagnostic thinking, but they must be thoroughly validated, closely supervised, and used alongside, not instead of, human expertise.
Citation: Dani, M., Prakash, M.J., Rosa, F. et al. Evaluating large language models for diagnostic reasoning from unstructured clinical narratives in epilepsy. Commun Med 6, 303 (2026). https://doi.org/10.1038/s43856-026-01653-z
Keywords: epilepsy, seizure semiology, large language models, diagnostic reasoning, clinical AI evaluation