Clear Sky Science · en
Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education
Why smarter exam helpers matter
Future doctors around the world must pass tough licensing exams before they can treat patients. At the same time, powerful chatbots based on large language models are becoming common study partners. This article looks closely at how two such systems, DeepSeek and ChatGPT, handle the medical licensing exams used in the United States (USMLE) and China (CNMLE), and asks a simple question with big consequences: Can these tools really help educate safe, well‑prepared physicians—and if so, under what safeguards?

Two big exams, two powerful tools
The researchers focused on the USMLE and CNMLE, national exams that test a broad range of medical knowledge, from basic science to clinical decision‑making. They gathered hundreds of real questions: 243 from the USMLE sample exam and 300 from the CNMLE question bank, covering topics like internal medicine, surgery, pediatrics, psychiatry, and more. Questions that required looking at medical images were removed so that both tools faced only text‑based challenges. The team then queried two versions of each system—GPT‑4o‑mini for ChatGPT and DeepSeek‑R1 for DeepSeek—in both English and Chinese, using simple instructions that mimicked how a real student might ask for help during exam preparation.
Who answered better, and how consistently?
To compare the tools fairly, the researchers ran each exam three times with each system, then measured how often the answers matched the official key. On the USMLE questions, DeepSeek answered correctly about 93% of the time, slightly ahead of ChatGPT at about 90%. DeepSeek held a similar edge on the CNMLE, scoring about 87% versus ChatGPT’s 79%. DeepSeek outperformed ChatGPT across all three USMLE steps, including the most complex clinical decision‑making section, and across all four CNMLE units, especially in areas heavy in Chinese‑language basic science and clinical knowledge. The team also checked how stable the tools were across repeated runs, finding that both showed high consistency, with DeepSeek again slightly stronger.
Thinking out loud, but sometimes too slowly
Modern language models often show their reasoning step by step, much like a student writing out their logic. The researchers counted the number of characters in these explanations as a rough measure of how much “thinking” each system displayed. On the USMLE, the two tools were similar, providing reasoning of comparable length. On the CNMLE, however, DeepSeek produced notably longer explanations, suggesting deeper or more detailed chains of thought when dealing with complex Chinese medical questions. The trade‑off was speed: DeepSeek took longer to complete both exams, especially the CNMLE, while ChatGPT answered more quickly. In other words, DeepSeek tended to be more accurate and more verbose, whereas ChatGPT favored efficiency.

Promise, pitfalls, and a new safety net
Despite their strong scores—higher, on average, than many human test‑takers—both systems still made important mistakes. In some cases they chose plausible‑sounding but wrong treatments or misunderstood subtle concepts, a well‑known issue called “hallucination,” where the model confidently invents or misapplies facts. At the same time, they showed surprising strengths, such as spotting flawed exam questions that had no correct answer at all. Because medical education is closely tied to patient safety, the authors argue that these tools must be treated as helpers, not authorities. To support safer use, they propose a technical “fact‑checking loop” that links the model to a carefully built medical knowledge graph. When the model answers a question, its claims would be broken down, checked against trusted sources like guidelines and textbooks, and assigned confidence levels before being shown to learners.
What this means for future medical training
For non‑experts, the message is both encouraging and cautious. DeepSeek and ChatGPT already perform at or above the level of many medical students on written exams, suggesting they can meaningfully support study, practice questions, and even the redesign of teaching around richer, step‑by‑step reasoning. Yet their errors—and the opacity of how they reach conclusions—mean they cannot replace human teachers or licensed clinicians. The authors envision a future in which such systems act as tightly supervised “assistant coaches” for medical learners, embedded in a framework that demands evidence, tracks reliability, and keeps human judgment firmly in charge. If built and governed carefully, these AI helpers could gradually shift medical education from simple memorization toward more interactive, generative learning—without losing sight of the ultimate goal: safer care for real patients.
Citation: Wang, Q., Li, J., Li, X. et al. Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education. Sci Rep 16, 13880 (2026). https://doi.org/10.1038/s41598-026-40043-2
Keywords: medical education AI, large language models, USMLE performance, Chinese medical licensing exam, fact-checking framework