Clear Sky Science · en
Performance of DeepSeek in the generation of in-training examination questions in radiology resident education
Why smarter test questions matter
Medical imaging doctors in training take frequent tests to check what they know and how well they can care for patients. Writing these test questions takes a lot of expert time, and schools are wondering whether artificial intelligence tools can help. This study looked at whether a large language model called DeepSeek could share some of that work by writing multiple choice questions for radiology residents, and where human experts are still clearly needed.

What the researchers set out to learn
The team in China focused on a key part of radiology training: in‑training exams that track residents’ progress each year. They compared two matching sets of exam questions. One set was written by experienced radiologists following national training standards. The other set was generated in Chinese by the DeepSeek language model using carefully designed prompts that specified topic, trainee level, and type of question. All questions had to follow the same rules and were screened by a senior radiologist to make sure they were accurate and fair before being used.
How the exam experiment worked
From these question banks, the researchers randomly picked 14 AI questions and 14 expert questions and mixed them into one 28‑item online test. Forty radiology residents in their second or third year took this closed‑book exam. For each item, they chose an answer, guessed whether the question came from DeepSeek or a human expert, and rated it on difficulty, fit with the curriculum, overall quality, and how realistic the clinical story felt. This design let the team compare not only scores but also how the questions felt to learners.

Where AI matches human writers
Across the whole exam, residents got about the same proportion of questions right whether they were written by DeepSeek or by experts, and they were not very good at telling which source each item came from. For the simplest question type, aimed at basic facts and clear rules, DeepSeek’s items performed much like the human‑written ones. Objective measures used in testing, such as how well a question separates stronger and weaker students, also suggested that these basic knowledge items from AI were generally solid. This means AI could help build large banks of straightforward questions that reinforce core concepts, easing the workload on educators.
Where human judgment still leads
The picture changed when questions involved richer patient stories and harder decisions. For medium‑complexity questions with brief clinical scenes, residents answered AI and expert questions correctly at similar rates, but they rated the expert versions as more realistic and somewhat harder, especially among more senior residents who have more real‑world experience. For the most complex questions built around multi‑step case series and judgment calls, residents scored clearly higher on the expert‑written items than on DeepSeek’s versions. Trainees, particularly those in earlier years, seemed more likely to be misled or confused by the thinner, less authentic clinical situations created by the AI.
How people and AI can work together
The authors suggest using a tiered approach. DeepSeek and similar tools are well suited to drafting large numbers of basic, well‑structured questions that cover standard facts and definitions. Human experts, in turn, should remain in charge of questions that test how doctors think through uncertainty, weigh options, and apply values in real clinical settings. AI can also help reviewers spot weaker questions, while experts supply the nuanced understanding that comes only from caring for patients. With clear boundaries and careful oversight, combining AI with expert judgment could make medical exams both more efficient to build and better at measuring what truly matters.
Citation: Qian, W., Li, K., Cao, F. et al. Performance of DeepSeek in the generation of in-training examination questions in radiology resident education. npj Digit. Med. 9, 384 (2026). https://doi.org/10.1038/s41746-026-02568-8
Keywords: radiology education, exam questions, artificial intelligence, large language models, medical training