Clear Sky Science · en

Advancing conversational diagnostic AI with multimodal reasoning

· Back to index

Why smarter online health chats matter

Many of us now turn to online chats or apps when we feel unwell, sending photos of rashes, snapshots of test results or heart tracing readouts from a watch. Yet most health chatbots still work on text alone, ignoring this rich stream of pictures and documents. This paper explores a new kind of medical AI assistant that can see and talk, weaving images and records into a conversation much like a careful doctor would in a telehealth visit.

Figure 1. AI assistant that combines chat with medical photos and test images to support remote diagnosis.
Figure 1. AI assistant that combines chat with medical photos and test images to support remote diagnosis.

A new kind of medical helper

The researchers built an updated version of a system called the Articulate Medical Intelligence Explorer, or AMIE. Instead of just reading and writing, this new multimodal AMIE can receive skin photos, electrocardiogram images and clinical documents during a chat. It then reasons about all of this together with the patient’s written story. Under the hood, AMIE runs on a powerful general language and vision model, but is wrapped in a framework that guides it through the typical stages of a medical visit: asking questions, forming likely causes and suggesting next steps.

Guided conversations that adapt

Real doctors do not ask questions at random. They listen, build a mental picture of the patient and adjust their questions as new clues appear. To mimic this, the team designed what they call a state aware dialogue framework. As the chat unfolds, AMIE maintains an internal summary of the patient’s history, symptoms and any uploaded images or documents. It also keeps a hidden list of possible diagnoses and knowledge gaps. This internal state helps AMIE decide when to keep asking about the history, when to request a photo or ECG, when it has enough information to outline likely causes and how to explain what it sees in the images.

Figure 2. Stepwise pipeline where mixed chat and medical images are processed into clearer diagnoses and care plans.
Figure 2. Stepwise pipeline where mixed chat and medical images are processed into clearer diagnoses and care plans.

Putting AI and doctors to the test

To see how well multimodal AMIE performs, the team ran a large simulated telehealth exam resembling the practical tests used in medical schools. Trained actors played patients in 105 different scenarios that required both conversation and interpretation of visual material, such as skin images, heart tracings or lab reports. Each actor had two separate text chat consultations, one with a board certified primary care doctor and one with the AI system, without knowing which was which. Afterward, the human clinicians and the AI both filled out structured summaries of their diagnoses and plans. Independent specialist doctors and the patient actors then rated the quality of each consultation.

How the new system measured up

Across these scenarios, multimodal AMIE’s diagnoses were more often correct than those of the primary care doctors, whether looking only at the top choice or at a broader list of possibilities. Specialists also judged AMIE’s reasoning, use of images and handling of patient questions about those images to be as good as or better than the doctors’ on most measures. Notably, when the pictures were of lower quality, both AI and doctors did worse, but the AI’s accuracy dropped less. Patient actors rated the AI at least as highly as doctors for politeness, clarity, empathy and willingness to return for another visit, and they felt the AI did a better job of addressing and explaining what was seen in the uploaded images.

Testing the inner workings

The authors also probed why the system works as it does. In computer based simulations, they compared the full version of AMIE against a simpler version that lacked the structured, state aware reasoning. The full system not only made more accurate diagnoses but also gathered information more effectively and produced more suitable care plans. When they removed the back and forth conversation and asked the model to work from images alone, performance clearly worsened, showing that dialogue and history taking still matter even for an AI that can see. Additional tests suggested that fine tuning the underlying model only on narrow medical tasks might boost some skills but harm others, so the authors instead focused on careful design of the reasoning process layered on top.

What this could mean for future care

The study suggests that AI systems able to combine talk with sight could one day help clinicians handle complex remote consultations more safely and efficiently. By interpreting patient supplied photos, heart tracings and documents within a thoughtful conversation, multimodal AMIE often matched or exceeded the performance of primary care doctors in this controlled setting while maintaining strong scores for empathy and communication. The authors stress that this is still exploratory work, not a real world clinical trial, and much remains to be done to test safety, fairness and impact in everyday practice. Still, it points toward a future in which AI tools serve as capable partners in telehealth, helping both patients and clinicians make better use of the images and information already flowing through our screens.

Citation: Saab, K., Park, C., Strother, T. et al. Advancing conversational diagnostic AI with multimodal reasoning. Nat Med 32, 1726–1736 (2026). https://doi.org/10.1038/s41591-026-04371-0

Keywords: multimodal medical AI, telehealth, diagnostic conversation, clinical decision support, medical chatbots