Clear Sky Science · en

Infrared-visible image fusion with double-attention mechanism and adaptive interaction loss

· Back to index

Seeing More Than One Camera Can

Imagine driving on a foggy night when your eyes and a thermal camera each catch different parts of the scene. One shows bright heat from people and cars, the other reveals lane markings, buildings, and signs. This study explains a new way to blend those two views into a single, clearer picture that can help humans and machines see better in complex outdoor environments.

Figure 1. Merging heat based and normal camera views into one clearer scene for better outdoor vision.
Figure 1. Merging heat based and normal camera views into one clearer scene for better outdoor vision.

Why Two Kinds of Images Matter

Visible light cameras capture the world much like our eyes do, with sharp detail and rich textures. Infrared cameras capture heat, so glowing shapes reveal warm engines, people, or animals even in darkness, fog, or glare. Each view is incomplete on its own. Visible images can lose important objects in bad weather or low light, while infrared images often look blurry and lack fine detail. Combining them into one image that keeps both sharp textures and bright heat signals is valuable for tasks like surveillance, remote sensing, and self driving cars.

The Challenge of Blending Different Views

For years, researchers have built computer programs that learn how to fuse infrared and visible images. Many modern methods use deep learning, where the computer figures out which features to keep and how to mix them. A popular idea is attention, which lets the network focus on the most important parts of an image. But earlier systems either focused only on each image alone or mixed them without enough control. That meant important details from one camera could drown out unique signals from the other, or the final image could become dull and less informative.

Paying Attention in Two Directions

The authors propose a new fusion model built around a double attention idea. First, the network looks within each image separately to understand its own patterns and structures, like edges, textures, and hot objects. Then it performs cross attention, where infrared and visible views interact and guide each other, so that matching regions can share useful information. These steps are handled with a modern building block called a Swin Transformer, which breaks images into small patches and looks at how distant regions relate. After this two step extraction, another attention block mixes the combined features into a single representation, which is turned back into an image.

Figure 2. Stepwise mixing of heat and detail regions so each image leads where it is most informative.
Figure 2. Stepwise mixing of heat and detail regions so each image leads where it is most informative.

Letting the Data Decide Who Leads

A key idea in this work is that the balance between the two cameras should change from place to place in the image. In some regions, heat based shapes are more important, like a person standing against a busy background. In other regions, visible texture matters more, such as road markings or building edges. The authors design an adaptive training rule that measures how visually active each camera is in each small patch of the image, and then automatically changes how strongly that patch influences the learning process. This guides the network to highlight whichever source is more informative locally, instead of forcing equal weight everywhere.

How Well the New Method Performs

The team tests their method on two standard collections of outdoor scenes that mix roads, vehicles, people, and complex backgrounds. They compare against seven leading fusion techniques drawn from different deep learning families. Both visual inspection and several numerical scores show that the new approach delivers images with higher contrast, sharper edges, and richer details while still preserving key thermal targets. Further tests, where parts of the model are removed or altered, confirm that both the cross attention design and the adaptive training rule play crucial roles in the improved results.

What This Means for Real World Vision

To a lay reader, the takeaway is simple. By teaching a computer not just to look at two cameras but to manage how they influence each other in a careful, location dependent way, this method produces clearer combined images than earlier approaches. That can make it easier for people and automated systems to spot important objects in tricky conditions, and the same ideas may help future tools that merge other types of sensor data.

Citation: Wang, Z., Hu, Y. & Zhang, B. Infrared-visible image fusion with double-attention mechanism and adaptive interaction loss. Sci Rep 16, 15941 (2026). https://doi.org/10.1038/s41598-026-45802-9

Keywords: image fusion, infrared imaging, computer vision, attention networks, autonomous driving