Clear Sky Science · en

Benchmarking quantum kernels and modern vision models for compound facial expression recognition

· Back to index

Why reading faces is harder than it looks

Many technologies now try to read our emotions from a simple webcam image, from mental health tools and driver-safety monitors to social robots and game testers. But real-life expressions are rarely just “happy” or “sad.” They are often blends—fear mixed with surprise, sadness tinged with disgust—that even people sometimes misread. This study asks a timely question: which modern computer systems, including emerging quantum-based methods, strike the best balance between accuracy and speed when decoding these subtle, mixed emotions from real-world faces?

Figure 1
Figure 1.

Blended emotions in everyday life

Instead of focusing on textbook basic emotions, the authors tackle compound expressions such as “fearfully surprised” or “sadly disgusted.” These nuanced states occur frequently in natural settings like clinics, cars, or social robots interacting with humans. The team uses a well-known image collection called RAF-DB, containing thousands of faces captured “in the wild” under varied lighting, poses, and demographics. They restrict attention to 11 compound categories and enforce identical data splits and preprocessing across all methods so that any differences in performance truly come from the models, not from cherry-picked training conditions.

Seven ways to teach computers to read faces

The study compares seven pipelines that represent three generations of technology. First come classical hybrids, which use established convolutional networks (ResNet50 and VGGFace) only as feature extractors, then hand off the final decision to a simpler margin-based classifier called an SVM. Second are two popular modern deep models: EfficientNetV2-S, a streamlined convolutional network tuned for efficiency, and ViT-B/16, a vision transformer that analyzes images as a set of patches and uses global attention to connect distant facial regions. Third are three quantum–classical hybrids. In these, a standard visual encoder produces compact numerical features that are then processed by quantum-inspired components: a quantum support vector machine (QSVM), a quantum k-nearest neighbor method (QKNN), or a quantum convolutional network (QCNN).

Speed, accuracy, and the trade-offs between them

Rather than chasing a single headline accuracy number, the authors carefully measure feature-extraction time, training time, and per-image classification time, all on the same hardware. ViT-B/16 comes out on top for accuracy, correctly classifying about 63% of compound expressions while keeping feature extraction surprisingly fast. EfficientNetV2-S is close behind at about 61% accuracy, but needs far more time to extract features. Among the quantum hybrids, QSVM performs best, reaching roughly 55% accuracy with only about a minute of feature-extraction time, making it attractive when computing budgets are limited. QKNN and QCNN are even more frugal with time—especially QCNN—but sacrifice accuracy, hovering in the mid-30% range. Classical hybrids sit in the middle, useful as transparent baselines but generally trailing the modern and quantum-enhanced options.

Where machines still get confused

A closer look at the errors shows that all systems struggle in similar ways. Confusions tend to cluster along two families: fear versus surprise, and sadness versus disgust (sometimes mixed with anger). These categories share similar facial muscle patterns—wide eyes and raised brows for fear and surprise, or downturned lips and nose wrinkles for sadness and disgust—so their visual footprints overlap. Even ViT’s global attention and QSVM’s more expressive quantum kernels cannot completely separate these lookalike expressions. The authors argue that future models should pay targeted attention to specific facial regions linked to action units (such as eye corners, brows, and the area around the nose), adjust their training objectives to widen margins between neighboring classes, and use balanced data augmentation strategies to avoid overfitting to the most common compounds.

Figure 2
Figure 2.

What this means for real-world emotion-aware systems

The authors do not claim that quantum methods have already surpassed classical deep learning. Instead, they provide a careful map of the current landscape. If absolute accuracy is paramount and computing resources are plentiful, vision transformers still lead. When developers must watch power budgets or latency—say, on edge devices or low-latency servers—quantum hybrids like QSVM and QKNN offer promising middle ground, trimming feature-extraction and inference time while maintaining respectable accuracy. Classical CNN-plus-SVM pipelines remain useful yardsticks. By combining rigorous compute accounting, detailed error analysis, and formal statistical tests, this work shows that reading complex human emotions is as much about smart resource allocation and fairness as it is about raw accuracy—and that quantum-inspired tools may soon become practical partners in that effort.

Citation: Florestiyanto, M.Y., Surjono, H.D. & Jati, H. Benchmarking quantum kernels and modern vision models for compound facial expression recognition. Sci Rep 16, 11261 (2026). https://doi.org/10.1038/s41598-026-41514-2

Keywords: facial expression recognition, compound emotions, vision transformers, quantum machine learning, efficient AI models