Clear Sky Science · en

MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

2026-02-05 · Back to index

Why testing medical AI in French matters

Most people now turn to online tools, sometimes powered by artificial intelligence, for health information. Yet the vast majority of these systems are trained and tested in English, even though millions of patients and doctors work in other languages. This article presents MediQAl, a large collection of French medical exam questions designed to reveal how well today’s AI systems actually understand and reason about medicine in French, and where they still fall short.

A new pool of real medical exam questions

The heart of MediQAl is a trove of 32,603 questions drawn from France’s national medical licensing exams. These high‑stakes tests, written by academic and hospital specialists, are built to mirror real clinical practice: they mix textbook facts with messy, real‑world scenarios in which symptoms unfold over time and important clues may be only implied. French exam style adds extra difficulty for machines: questions are long, sentences are complex, and traps often hinge on negations or exceptions such as “all of the following are true, except…”. By preserving this authentic structure, MediQAl offers a demanding, realistic playground for testing medical AI beyond simplified classroom examples.

Three ways to question an AI doctor

MediQAl is organized into three task types that mirror how doctors are tested. The first and largest group is single‑answer multiple‑choice questions, where only one of five options is correct. The second group allows several correct options, forcing systems to weigh combinations of findings the way a doctor might consider multiple possible complications at once. The third group consists of short, open‑ended questions where the system must generate its own brief answer instead of picking from a list. Every question is tagged as either testing straightforward understanding (recalling or applying known facts) or true reasoning (multi‑step thinking, combining clues, or dealing with uncertainty). This structure lets researchers probe not only what an AI “knows,” but how it thinks through a case.

How the dataset was built and checked

To assemble MediQAl, the author scraped training sites and official materials where students and teachers share past exam questions. Multiple‑choice questions were automatically extracted, while the less structured open‑ended questions required a mix of pattern‑matching and manual curation from web pages and PDFs. The team removed questions with missing answers, images or tables, very long free‑text answers, and near‑duplicates detected using similarity measures on both questions and solutions. To focus the hardest material in the test split, three smaller AI models were asked to answer the questions: any item that at least one model solved was deemed too easy for testing and redirected to training or validation. A medical expert then reviewed a stratified sample of 150 questions, confirming that the vast majority were medically sound and appropriately framed, with a small fraction flagged as outdated or ambiguous.

Putting leading AI models to the test

With MediQAl in hand, the study evaluated 14 large language models, ranging from widely known commercial systems to open‑source models tuned for medicine or for step‑by‑step reasoning. All were tested in a “zero‑shot” setting, meaning they were simply prompted to answer without task‑specific coaching. The results show clear patterns. First, performance is consistently higher on simple recall questions than on reasoning‑heavy ones, across every model and task type. On average, accuracy on reasoning questions drops several points compared with understanding questions, with the gap especially large for open‑ended answers. Second, models explicitly trained to reason tend to outperform their “vanilla” counterparts, particularly on the hardest questions, but still fall far short of the reliability expected from practicing clinicians. Third, success varies widely by specialty: subjects like genetics, dermatology, or bacteriology are handled relatively well, while areas such as psychiatry, epidemiology, occupational medicine, and complex open cases remain challenging.

What this means for patients and practitioners

MediQAl fills a major gap by offering a large, carefully curated benchmark that tests medical AI in French and across 41 specialties, using questions designed for future doctors rather than for machines. The findings show that, while top systems can often recall facts correctly and sometimes match exam‑style answers, they still struggle when asked to reason through nuanced clinical stories, especially outside English and in certain domains. For patients and healthcare providers, the message is clear: current AI tools can be helpful assistants but are not ready to replace human judgment, and their limits depend strongly on language and specialty. For researchers and regulators, MediQAl provides a public, reusable testbed to track progress in safe, equitable medical AI that works as well in French as in English.

Citation: Bazoge, A. MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation. Sci Data 13, 356 (2026). https://doi.org/10.1038/s41597-026-06680-y

Keywords: medical question answering, French language AI, clinical reasoning, large language models, medical exams