Clear Sky Science · en
Towards accurate and interpretable competency-based assessment: enhancing clinical competency assessment through multimodal AI and anomaly detection
Why Smarter Training for Doctors Matters
When doctors train to handle medical emergencies, their performance is often judged by human examiners who watch them work in simulated scenarios. These ratings are vital for patient safety, yet they can be subjective, uneven across examiners, and too coarse to show trainees exactly what to improve. This study introduces a new artificial intelligence (AI) system that watches and listens during high‑fidelity anesthesia simulations and converts what it sees and hears into an objective, interpretable measure of clinical competence. The goal is not to replace expert teachers, but to give them a more precise, fair, and detailed lens on how residents actually behave under pressure.

Watching Emergencies from Many Angles
The researchers focused on critical care simulations used to prepare anesthesia residents in Israel for their national board exam. Ninety residents from 17 hospitals managed life‑threatening crises in a realistic operating‑room setting with a full‑body mannequin, a nurse, and an intern. Each scenario unfolded in four stages: an initial stable period, a phase of rapid deterioration, active resuscitation using standard life‑support protocols, and finally stabilization and handoff. Throughout, cameras recorded the room and the patient monitor, microphones captured speech, and the vital‑sign display itself was digitized. Board‑certified anesthesiologists then gave each resident an overall performance rank from 1 (poor) to 5 (exemplary).
Turning Behavior into Data Streams
To make this rich scene analyzable by AI, the team transformed the videos and audio into synchronized time‑series signals. One stream tracked when a resident’s gaze fell on the patient monitor, using face detection and gaze‑target estimation. A second estimated where the resident stood and moved in the room, based on three‑dimensional body pose. A third marked when the resident spoke, after cleaning the audio to isolate their voice from background noise. Finally, the researchers read the changing heart rate, blood pressure, breathing rate, and oxygen saturation directly from the monitor screen using optical character recognition, producing continuous curves of physiological status. All of these channels were aligned frame‑by‑frame, yielding a detailed, moment‑to‑moment portrait of how residents looked, moved, spoke, and responded to the patient’s condition.

Learning What “Expert‑Like” Looks Like
Instead of teaching the AI to copy human scores directly, the authors used an anomaly‑detection model called MEMTO, originally designed for spotting unusual patterns in complex time series. First, they trained MEMTO only on the best performances—residents ranked 5—to learn what “ideal” behavior over time looks like across all signals. Once this baseline was in place, the model processed every resident’s simulation and produced an anomaly score at each moment, reflecting how far that instant’s behavior departed from the expert pattern. These anomaly scores were then aggregated and smoothly mapped onto the familiar 1–5 scale, so that lower deviations from the expert template yielded higher competency scores.
What the AI Learned About Good Performance
The multimodal approach—combining gaze, movement, speech, and vital signs—proved crucial. When trained on top‑rank residents, the model’s scores aligned closely with expert ratings, with strong correlations and consistency measures, and it sorted residents in nearly the same order as human examiners. In contrast, relying on a single stream, such as gaze alone, produced much weaker agreement. Training the model on the worst performances also led to poorer alignment, underscoring that benchmarks should be anchored in expert behavior rather than common mistakes. To make the system’s decisions understandable, the team used an explanation method known as SHAP, which highlights which inputs most influenced the anomaly scores. Communication and eye contact with the monitor emerged as especially important, particularly during crisis escalation and active resuscitation, while vital signs became more influential during stabilization.
What This Means for Future Medical Training
This work shows that AI can move clinical training beyond simple checklists or pass–fail ratings by capturing how trainees actually behave second‑by‑second in realistic emergencies. By comparing each resident to a data‑driven portrait of expert performance, the system can flag when communication falters, attention to the monitor lapses, or responses to changing vital signs are off‑pattern—information that can guide richer, phase‑specific feedback in debriefing sessions. The authors emphasize that such tools should augment, not replace, human judgment, and must be deployed carefully, with strong privacy protections and fairness checks. Still, their results suggest a path toward more objective, transparent, and educationally useful assessments that can scale across training programs and, ultimately, help make real‑world patient care safer.
Citation: Gershov, S., Mahameed, F., Raz, A. et al. Towards accurate and interpretable competency-based assessment: enhancing clinical competency assessment through multimodal AI and anomaly detection. npj Digit. Med. 9, 219 (2026). https://doi.org/10.1038/s41746-025-02299-2
Keywords: clinical competency assessment, medical simulation, multimodal AI, anomaly detection, medical education