Clear Sky Science · en

Hallucination-aware learning and latency optimization transformer (HALL-OPT) for real-time edge intelligence

· Back to index

Why Faster, More Trustworthy AI Matters

Everyday devices are quietly becoming smarter, from factory sensors and hospital monitors to cars and home gadgets. Many of these systems rely on language models – the same kind of AI behind modern chatbots – to read instructions, answer questions, or summarise reports. But two problems get in the way: these models are slow and power-hungry, and they sometimes “hallucinate” convincing but false statements. This paper introduces HALL-OPT, a redesign of transformer-based language models that aims to make them both faster and more reliable so they can safely run on small, low-power edge devices rather than in distant data centres.

Figure 1
Figure 1.

The Challenge of Smart Devices at the Edge

Most high-performing language models live in the cloud, where they can use large amounts of computing power. That makes them hard to use in places where fast decisions are vital and network connections are imperfect or costly, such as autonomous vehicles, industrial robots, or bedside medical devices. When such systems send data to the cloud and wait for a reply, delays of even a few hundred milliseconds can be unacceptable. At the same time, lighter models that fit on edge devices often respond more quickly but are more likely to make up facts or misinterpret information. The study shows that this creates a trade-off: low hallucination usually comes with high delay, while low delay often means more hallucinations, leaving a gap for real-time, trustworthy edge intelligence.

A Unified Design Instead of Separate Fixes

Existing research typically treats reliability and efficiency as two separate goals. Some methods focus on catching hallucinations by checking answers against external databases or running multiple model passes, which adds extra time and energy use. Other methods shrink models with pruning, quantisation, or knowledge distillation, making them faster but sometimes less accurate and less trustworthy. HALL-OPT takes a different route: it weaves hallucination awareness directly into the model’s inner workings and uses that same information to decide what to compute and what to skip. Instead of bolting on extra checks or blindly trimming the network, it coordinates both reliability and speed in a single framework tailored for edge hardware.

How the System Filters Out Risky Content

At the heart of HALL-OPT is a hallucination-aware attention module that watches how the model spreads its focus across words and how confident it is in its predictions. When attention is scattered, confidence is low, or a token’s meaning clashes with the surrounding context, the token is given a higher “risk” score. A dual-stream detector then flags these risky pieces as potential hallucinations. The model uses these signals to drive a dynamic pruning stage: tokens that are both low value and high risk are removed, while important, trustworthy tokens are kept. This reduces the number of elements the model must process at each layer, cutting down the heavy, quadratic cost of attention without losing the core meaning of the text.

Packing a Large Model into a Small, Efficient One

To fit powerful behaviour into a smaller package, HALL-OPT applies knowledge distillation, where a large “teacher” model trains a compact “student” model. Unlike standard distillation, the student is taught not only to match the teacher’s answers but also to mimic its sense of when outputs are likely to be wrong. Additional training nudges the student to avoid overconfident, hallucination-prone predictions. Finally, an edge optimisation layer prepares the model for low-precision arithmetic, turning its weights into 8-bit values and restructuring computation to match real edge devices such as NVIDIA Jetson boards and Google’s Coral TPU. This combination preserves most of the original accuracy while sharply reducing memory use, energy consumption, and response time.

Figure 2
Figure 2.

Real-World Impact on Speed, Energy, and Safety

Tests on two demanding benchmarks – one for question answering with trick unanswerable questions, and another for news summarisation – show that HALL-OPT detects hallucinations with about 94% accuracy and keeps task performance close to a standard BERT model. At the same time, it cuts inference latency by roughly two-thirds and reduces energy use by around 40% or more when averaged across realistic workloads. On edge devices, it often responds in under 50 milliseconds and uses substantially less memory. Stress tests across many platforms and industrial-style scenarios, from smart factories to healthcare monitors, confirm that the system maintains predictable timing and a favourable “inferences per watt” rate, making it suitable for continuous, real-time use.

What This Means for Everyday AI

For non-specialists, the key message is that we do not have to choose between fast AI and trustworthy AI on small devices. By teaching the model to recognise its own weak spots and letting that awareness guide how much it computes, HALL-OPT delivers responses that are both quick and less likely to be fabricated. This makes it a promising backbone for future edge applications where faulty answers or slow reactions could have serious consequences, such as guiding a vehicle, controlling industrial machinery, or flagging critical changes in a patient’s condition.

Citation: Algawiaz, D. Hallucination-aware learning and latency optimization transformer (HALL-OPT) for real-time edge intelligence. Sci Rep 16, 12245 (2026). https://doi.org/10.1038/s41598-026-42981-3

Keywords: edge AI, hallucination detection, transformer models, real-time inference, energy-efficient computing