Clear Sky Science · en
MmodalFire: A Continuous Multimodal Dataset Comprising Video and Physical Sensing Data for Detecting Indoor Fires
Why better fire alarms matter
In large offices, data centers, or apartment buildings, a few minutes can make the difference between a small incident and a deadly blaze. Traditional ceiling‑mounted smoke or heat detectors often react only after smoke has risen or the room has warmed up, which can take precious time. Meanwhile, modern security cameras watch the same spaces continuously but can be fooled by fog, steam, or bright reflections. This article introduces a new kind of resource: a carefully designed dataset that combines both camera video and physical sensor readings, so that artificial intelligence systems can learn to spot indoor fires faster and more reliably than either method alone.
A new way to look at indoor fires
The authors present MmodalFire, a public dataset created specifically for indoor fire detection research. Instead of relying on either video or stand‑alone sensors, MmodalFire records both at the same time. Each experiment captures high‑definition video together with six kinds of sensor outputs, including smoke density, temperature, and radiation in several infrared and ultraviolet bands. Every short sequence is labeled simply as “fire” or “non‑fire,” allowing computer models to learn to tell dangerous events from harmless look‑alikes. By making this dataset freely available, the team aims to give researchers a common, realistic test bed for comparing fire detection algorithms.

How the experiments were built
To build MmodalFire, the researchers set up identical test rooms in two laboratories in China. Each room was about the size of a small office, with fixed walls and ceiling‑mounted detectors plus a camera in one corner for a full view. They carried out controlled burns of four common indoor materials: wood, cotton rope, polyurethane foam (like furniture stuffing), and n‑heptane (a clean‑burning liquid similar to some fuels). To make sure the system could also learn what isn’t a fire, they created two interference conditions: theatrical smoke made from dry ice and water mist from a household‑style humidifier. During every trial, the camera and sensors ran continuously, logging video frames and numerical readings with precise time stamps.
Capturing real‑world variety
Real buildings differ in lighting, air movement, and how close a fire might be to each detector, so the team varied these factors deliberately. They adjusted wind from still air to gentle breezes, switched between bright and dim lighting, changed how much fuel was used, and moved the fire closer to or farther from sensors and walls. In some runs the fire produced dense black smoke and rapid heating; in others, like n‑heptane, flames were clear with little smoke. For the negative cases, water mist and dry‑ice vapor looked very similar to smoke in the camera image, but barely disturbed the physical sensors. Altogether they collected 65 video sequences—over 700 minutes of footage—with synchronized sensor data, then chopped them into many overlapping five‑second clips that could each be used as a single training example.

Teaching machines to combine senses
Using MmodalFire, the authors built and tested several computer models. Some models used only video, others only sensor readings, and the most advanced fused both. The video branch relied on a lightweight deep‑learning network tailored for motion and appearance in short clips. The sensor branch treated the six numerical streams as a small grid that changes over time and used modern techniques such as transformers to understand their patterns. A fusion module then brought these two streams together, allowing the model to “decide” how much weight to give each source under different conditions. When evaluated on separate test data, the combined model clearly outperformed either single‑source approach, especially in tricky situations such as smoke that had not yet reached the ceiling sensors or harmless vapor that looked like smoke in the camera.
Robust alarms for complex spaces
The study concludes that carefully synchronized video and physical sensor data can make indoor fire alarms both faster and more trustworthy. By showing that a fused model can keep working even when the camera is blocked or when sensors react slowly, the work points toward smarter systems for critical facilities such as power plants, server rooms, and high‑occupancy buildings. MmodalFire gives researchers a shared, realistic dataset on which to design and compare such systems, opening the door to next‑generation alarms that use multiple “senses” to recognize real danger while staying quiet for everyday steam and stage smoke.
Citation: Jia, Y., Guo, Y., Chen, Y. et al. MmodalFire: A Continuous Multimodal Dataset Comprising Video and Physical Sensing Data for Detecting Indoor Fires. Sci Data 13, 489 (2026). https://doi.org/10.1038/s41597-026-06810-6
Keywords: indoor fire detection, multimodal sensors, video surveillance, fire safety dataset, deep learning alarms