Clear Sky Science · en
Vision-Enabled AI scribes reduce omissions in clinical conversations: evidence from simulated medication histories
Smarter Digital Helpers in the Clinic
Anyone who has sat in a doctor’s office and watched a clinician type furiously into a computer has glimpsed a hidden burden in modern medicine: paperwork. New "AI scribes" promise to listen to visits and draft notes automatically, giving clinicians more time to talk with patients. But most of these tools only hear what is said; they cannot see what is shown. This study asks a simple question with big consequences for safety: what if an AI scribe could also see the medicine bottles on the table?

Why Seeing Matters as Much as Hearing
In real medical encounters, crucial information is often visual. Patients bring in boxes and bottles with small-print labels, demonstrate inhalers or injectors, or show allergy bands on their wrists. Subtle cues like appearance and posture can hint at how well someone is coping. Traditional AI scribes only process audio, so any detail that is never spoken aloud—such as the exact strength on a pill bottle—can be lost. When the goal is to build an accurate list of a person’s medicines, missing a dose or confusing two similar products could have serious consequences.
Glasses, Video, and a New Kind of AI Scribe
To tackle this gap, the researchers created a vision-enabled AI scribe that can process both sound and images. They paired Ray-Ban smart glasses, which record video and audio from the clinician’s point of view, with a state-of-the-art AI model that can interpret what it sees and hears together. Ten clinical pharmacists acted out 110 realistic medication-history conversations, each involving three to five medicines and real-world packaging. The team used 10 recordings to fine-tune prompts—clear instructions that tell the AI exactly what to extract—then locked those settings and tested the system on the remaining 100 recordings.
How Well the AI Scribe Performed
For each conversation, human pharmacists prepared a careful reference list, including the patient’s name, birth date, allergies, each medicine’s name, strength and form, dosing schedule, reason for use, and any extra notes. The AI’s job was to generate the same structured summary from the video. Across 2,160 individual data points, the vision-enabled scribe was correct 98 percent of the time. It did slightly less well on basic patient details (96 percent) and slightly better on medication-related items such as dosing directions and indication (both 99 percent). Most of the 46 total mistakes were "commission" errors—recording something incorrectly—such as mixing up similar drug names or strengths. Only 10 were omissions, where the AI left a field blank even though the information was present.

Why Adding Vision Changed the Game
The team then asked how much the visual input actually helped by running the same 100 conversations through the AI using only the audio track. Accuracy dropped sharply to 81 percent. The biggest collapse was in documenting strength and form of medicines, which fell from 97 percent correct with video to just 28 percent with audio alone—a clear sign that label reading matters. Omissions exploded from 10 with video to 358 with audio-only, showing that much of the missing information simply was never spoken aloud. For many fields, especially medication names and dose details, having the AI "look" at the packaging dramatically reduced gaps and misunderstandings.
What This Could Mean for Future Care
Although the results are impressive, the authors stress that this technology is not ready to replace human judgment. The study used simulated encounters in controlled settings with clear labels and good lighting, and the AI still made 46 errors that a clinician would need to catch. Real clinics are noisier, messier, and more varied. There are also important questions about privacy, consent, cost, and how being recorded affects what patients choose to share. Still, the work points to a future in which AI scribes that both see and hear could relieve some of medicine’s paperwork burden, capture more complete medication information, and help clinicians focus on what matters most: their patients.
Citation: Menz, B.D., Scarfo, N.L., Modi, N.D. et al. Vision-Enabled AI scribes reduce omissions in clinical conversations: evidence from simulated medication histories. npj Digit. Med. 9, 287 (2026). https://doi.org/10.1038/s41746-026-02494-9
Keywords: AI medical scribes, multimodal AI, medication history, clinical documentation, smart glasses