Clear Sky Science · en

Benchmarking agreement between large language models and published clinical trial conclusions across four artificial intelligence platforms

2026-04-02 · Back to index

Why this matters for everyday health care

Doctors rely on large clinical trials to decide which treatments are safe and effective. At the same time, new artificial intelligence tools are getting better at reading and summarizing medical research. This study asks a simple but important question for patients and clinicians alike: when these tools read the same trials as human experts, do they come to the same bottom-line conclusions about what works and what does not?

How the researchers tested the AI tools

The team focused on 20 well known clinical trials published in the New England Journal of Medicine, covering heart disease, stroke, diabetes, cancer, and brain surgery. These trials were chosen because they were carefully designed and clearly reported, making them a strong test ground. Instead of feeding the full articles to the AI systems, the researchers provided only the tables and figures that held the numbers, such as event rates and outcome charts. This forced the tools to lean on the data itself rather than simply copying the authors’ written summaries.

Figure 1. How different AI tools read the same medical trials and line up with doctors’ conclusions.

What the AI systems were asked to do

Four widely used large language models were tested: ChatGPT, Gemini, Grok3, and Claude. Each model received the same standardized prompt, asking it to interpret the data in five ways. The models had to explain the overall findings, make sense of the statistics, connect the results to patient care, point out study limitations, and suggest how the findings might be applied in practice. Two trained analysts then compared each AI answer to the original trial paper and scored performance in each of these five areas on a scale from zero to five.

How well the AI matched human conclusions

ChatGPT showed the strongest agreement with the published trial conclusions, earning a perfect median score of 25 out of 25 across the 20 trials. Gemini followed with 21 out of 25, while Grok3 and Claude trailed with median scores of 18 and 17. All four tools performed best at describing why the results matter for patients, and ChatGPT in particular scored at the top in every domain. Gemini also did well at spotting study weaknesses and potential confounding factors, whereas Grok3 and Claude were less reliable in recognizing limitations and in giving practical treatment suggestions. The two human raters closely agreed with each other, suggesting that the scoring method itself was stable.

Figure 2. Step-by-step view of how AI turns trial numbers into judgments about treatments and their limits.

Caution about hidden training and real-world safety

Even though the numbers look impressive, the authors warn that the results should be interpreted with care. The trials they used are famous and likely appeared in the training data for these AI systems. That means the tools may already “know” these studies and could be recalling patterns they have seen before rather than reasoning independently from the supplied tables. The lack of blinding to which system produced each answer also leaves room for subtle human bias in scoring. In addition, the trials chosen mostly had clear, positive findings, which represent a best-case scenario rather than the messy and uncertain research that often shapes real-world decisions.

What this means for future care

For a layperson, the takeaway is that some AI tools, particularly ChatGPT and Gemini, can often read medical trial data and agree with expert conclusions, at least for well known, high quality studies. This suggests they may be useful helpers for summarizing complex research and organizing evidence, but they are not ready to replace doctors or researchers. Their training history is opaque, their performance varies across platforms, and their answers have not been proven safe for making direct treatment decisions. The authors argue that AI should be viewed as a powerful assistant that can sift through numbers and highlight patterns, while human clinicians remain responsible for judgment, empathy, and final choices about patient care.

Citation: Mao, G., Snyder, W., Chinthala, A.S. et al. Benchmarking agreement between large language models and published clinical trial conclusions across four artificial intelligence platforms. Sci Rep 16, 15606 (2026). https://doi.org/10.1038/s41598-026-45326-2

Keywords: large language models, clinical trials, medical AI, evidence synthesis, clinical decision support