Clear Sky Science · en

A large-scale benchmark for evaluating large language models on medical question answering in Romanian

2026-02-21 · Back to index

Why this matters for health and technology

Many people now turn to online tools and chatbots for health information, but most of these systems work best in English and struggle with local medical records. This paper tackles that gap for Romania, where doctors write long, complex case summaries in Romanian and need reliable digital helpers to quickly answer questions about cancer patients. The authors present MedQARo, a new resource that lets researchers seriously test and improve large language models so they can better understand real clinical notes in Romanian.

A new question bank built from real patients

The heart of the study is MedQARo, a very large collection of 105,880 question–answer pairs linked to 1,242 cancer patients. Instead of translating English data, the team started from scratch with original Romanian case summaries, mostly for breast and lung cancer, plus several other tumor types. Seven oncology specialists and residents spent nearly 3,000 hours reading these documents and writing answers to carefully designed medical questions. Some questions are yes/no, others pull out specific details, and some require combining clues to infer stages or treatment timelines. All patient data was fully anonymized and approved by ethics committees.

Testing AI on home‑grown medical language

Using MedQARo, the authors evaluated several families of large language models, including two tuned for Romanian in general, one designed to handle very long texts, and one trained on English medical material. They also compared them with two powerful commercial models accessed through paid APIs. Each model had to read the question and an excerpt of the clinical summary, then generate the answer. The researchers looked not only at exact matches, but also at how often the models captured the key words and how well they handled flexible Romanian wording, using four different scoring measures.

Fine‑tuned models beat “out‑of‑the‑box” giants

Across the board, models that were used straight “out of the box” performed poorly on MedQARo, even when they were strong in English or had some Romanian exposure. Simple baselines that always guessed the most common answer sometimes did almost as well as these zero‑shot systems. Once the researchers fine‑tuned the models on the new dataset, however, performance jumped dramatically. The best system, a Romanian‑adapted model called RoMistral‑7B, reached an F1 score of about 0.67 on familiar cancer types and hospitals, clearly beating all other open‑source and commercial models. Yet even this leader still answered more than a third of questions incorrectly, showing how demanding the benchmark is.

Stress‑testing generalization across clinics and cancers

To see whether these systems could cope with new situations, the team built a tougher test set from a different medical center and from cancer types not seen during training. In this cross‑domain scenario, every model’s performance dropped, often sharply, with the best fine‑tuned model answering correctly well under half of the time. Models trained on English biomedical texts did not automatically transfer well to Romanian notes, and simply giving models a much longer slice of the clinical document did not help much either. In fact, focusing on the first part of the summary often worked better than feeding in the entire long record, suggesting that more context can confuse rather than clarify.

What this means for future clinical AI

For a lay reader, the take‑home message is that building safe and useful medical AI in languages like Romanian requires more than just plugging local data into a big English‑centric chatbot. Carefully crafted, language‑specific benchmarks such as MedQARo reveal both the potential and the limits of current systems. They show that small, open‑source models, when fine‑tuned on high‑quality local data, can outperform much larger general models running in the cloud. At the same time, the moderate scores, especially on new hospitals and cancers, warn that today’s tools are not ready to replace human judgment. Instead, MedQARo offers a solid foundation for the next generation of clinical assistants that can help Romanian doctors navigate complex cancer records while keeping patients’ safety and privacy at the center.

Citation: Rogoz, AC., Ionescu, R.T., Anghel, AV. et al. A large-scale benchmark for evaluating large language models on medical question answering in Romanian. npj Digit. Med. 9, 268 (2026). https://doi.org/10.1038/s41746-026-02465-0

Keywords: medical question answering, Romanian language AI, cancer clinical records, large language models, MedQARo benchmark