Clear Sky Science · en

Performance of DeepSeek in the generation of in-training examination questions in radiology resident education

2026-03-24 · Back to index

Why smarter test questions matter

Medical imaging doctors in training take frequent tests to check what they know and how well they can care for patients. Writing these test questions takes a lot of expert time, and schools are wondering whether artificial intelligence tools can help. This study looked at whether a large language model called DeepSeek could share some of that work by writing multiple choice questions for radiology residents, and where human experts are still clearly needed.

Figure 1. AI helps radiology teachers create exam questions while doctors in training use them to learn more effectively.

What the researchers set out to learn

The team in China focused on a key part of radiology training: in‑training exams that track residents’ progress each year. They compared two matching sets of exam questions. One set was written by experienced radiologists following national training standards. The other set was generated in Chinese by the DeepSeek language model using carefully designed prompts that specified topic, trainee level, and type of question. All questions had to follow the same rules and were screened by a senior radiologist to make sure they were accurate and fair before being used.

How the exam experiment worked

From these question banks, the researchers randomly picked 14 AI questions and 14 expert questions and mixed them into one 28‑item online test. Forty radiology residents in their second or third year took this closed‑book exam. For each item, they chose an answer, guessed whether the question came from DeepSeek or a human expert, and rated it on difficulty, fit with the curriculum, overall quality, and how realistic the clinical story felt. This design let the team compare not only scores but also how the questions felt to learners.

Figure 2. Compare AI and human written questions for simple facts versus complex patient cases to show where each works best.

Where AI matches human writers

Across the whole exam, residents got about the same proportion of questions right whether they were written by DeepSeek or by experts, and they were not very good at telling which source each item came from. For the simplest question type, aimed at basic facts and clear rules, DeepSeek’s items performed much like the human‑written ones. Objective measures used in testing, such as how well a question separates stronger and weaker students, also suggested that these basic knowledge items from AI were generally solid. This means AI could help build large banks of straightforward questions that reinforce core concepts, easing the workload on educators.

Where human judgment still leads

The picture changed when questions involved richer patient stories and harder decisions. For medium‑complexity questions with brief clinical scenes, residents answered AI and expert questions correctly at similar rates, but they rated the expert versions as more realistic and somewhat harder, especially among more senior residents who have more real‑world experience. For the most complex questions built around multi‑step case series and judgment calls, residents scored clearly higher on the expert‑written items than on DeepSeek’s versions. Trainees, particularly those in earlier years, seemed more likely to be misled or confused by the thinner, less authentic clinical situations created by the AI.

How people and AI can work together

The authors suggest using a tiered approach. DeepSeek and similar tools are well suited to drafting large numbers of basic, well‑structured questions that cover standard facts and definitions. Human experts, in turn, should remain in charge of questions that test how doctors think through uncertainty, weigh options, and apply values in real clinical settings. AI can also help reviewers spot weaker questions, while experts supply the nuanced understanding that comes only from caring for patients. With clear boundaries and careful oversight, combining AI with expert judgment could make medical exams both more efficient to build and better at measuring what truly matters.

Citation: Qian, W., Li, K., Cao, F. et al. Performance of DeepSeek in the generation of in-training examination questions in radiology resident education. npj Digit. Med. 9, 384 (2026). https://doi.org/10.1038/s41746-026-02568-8

Keywords: radiology education, exam questions, artificial intelligence, large language models, medical training