Clear Sky Science · en

Comparative performance of recent and prior large language models and pediatric residents on pediatric in-training examination questions

2026-04-02 · Back to index

Why this matters for doctors and families

As artificial intelligence tools begin to appear in hospitals and medical schools, a key question is simple: can these systems really match the judgment of doctors in training, especially when children’s health is at stake? This study looks at how several leading AI language models perform on pediatric exam questions and what that might mean for future care and education.

Testing AI on real exam questions

The researchers gathered 498 questions from pediatric in-training exams taken at a large children’s hospital in Korea between 2016 and 2023. These exams are used to gauge how well residents are progressing during their four years of training. Most questions were multiple choice and covered a broad spread of specialties, from newborn care and infections to heart disease and critical care. About one in five questions included medical images, such as X-rays, scans, or clinical photographs, while the rest relied on written descriptions alone.

Figure 1. AI systems and pediatric residents are compared on written exam questions that test children’s health knowledge.

How the study compared people and machines

Six well-known AI language models were tested, representing three major families of systems and two generations for each family: earlier versions and more recent versions with vision abilities. The models were fed full exam booklets, not single questions, and had to work out for themselves which parts were the question text, which were answer choices, and which were images. Questions were originally written in Korean with English medical terms, and carefully checked translations were provided. Both residents and AIs were graded using the same rules, counting an answer as correct if it matched the official solution or an accepted synonym. To see how stable the systems were, each test set was run five times, and consistency across runs was calculated.

How well AI did against pediatric residents

Performance was summarized as the proportion of questions answered correctly. As expected, human scores rose with training level: first-year residents answered a little over half the questions correctly, while fourth-year residents reached about 70 percent. The newer AI models did even better overall, scoring around 78 percent across all questions and clearly beating the most senior residents. Earlier AI versions performed on par with senior residents. When the researchers focused only on text-based questions, recent models outscored fourth-year residents by roughly 10 percentage points. The AI systems were also very consistent from run to run, with nearly identical scores each time.

Figure 2. AI models handle text questions better than image-based ones when answering pediatric exam problems.

Where AI still struggles with pictures

The picture changed once medical images were involved. On image-included questions, none of the AI systems outperformed senior residents. Newer models did better than their predecessors and reached mid-70 percent accuracy on these visual items, but their results still lagged behind their own strong performance on text-only questions. This pattern held across different types of images, including X-rays, scans, and clinical photos, and across a wide range of pediatric topics. The findings echo other research suggesting that, while language models are strong at reading and reasoning with text, their ability to understand medical images, especially in children, remains limited.

What this means for care and training

The authors argue that these results are encouraging for education but cautionary for direct clinical use. High and stable scores on written exam questions suggest that such systems could serve as useful study partners, giving pediatric trainees quick practice questions and explanations. However, success on multiple-choice tests does not guarantee safe performance on real patients, where information is messier, decisions are complex, and image interpretation is critical. In short, today’s multimodal AI tools can already rival senior residents on written pediatric exams, but they still fall short on image-heavy tasks and are not yet ready to replace human judgment in the clinic.

Citation: Kim, M.J., Park, J.S. & Kang, S.H. Comparative performance of recent and prior large language models and pediatric residents on pediatric in-training examination questions. Sci Rep 16, 15849 (2026). https://doi.org/10.1038/s41598-026-44333-7

Keywords: pediatrics, large language models, medical exams, clinical decision support, medical education