Clear Sky Science · en

Soft smooth contrastive learning with hybrid memory for unsupervised visible-infrared person re-identification

2026-03-17 · Back to index

Seeing People in the Dark

Modern cities are blanketed with cameras, but most of them struggle at night or in bad weather. Infrared cameras, which sense heat instead of visible light, can fill that gap. The challenge is teaching computers to recognize the same person when they look very different to a daylight camera and to a heat-sensing camera, and doing it without human experts labeling thousands of example images. This study proposes a new way to learn that matching automatically, making round-the-clock, privacy-aware security systems more practical.

Matching People Across Two Very Different Worlds

Visible-infrared person re-identification asks a simple-sounding question: given a person seen by a regular color camera, can we find the same person in footage from an infrared camera, or vice versa? In reality, the two types of images differ in color, contrast, and detail, so the computer’s internal description of a person can drift apart across camera types. Earlier systems often relied on large sets of hand-labeled images, where humans carefully indicated which pictures showed the same individual. That is expensive and slow, especially for multi-camera networks in large spaces like campuses, airports, or city blocks.

Learning Without Human Labels

The authors focus on the harder “unsupervised” version of the problem, where no ground-truth identity labels are provided. Instead, the computer first groups images that appear similar into clusters, treating each cluster as if it were one person. These guessed identities are called pseudo-labels. They power a popular training strategy known as contrastive learning, where the model pulls images from the same cluster closer together in its internal representation and pushes different clusters apart. But clustering is far from perfect: people wearing similar clothes can be confused, and the gap between visible and infrared views adds further mistakes. Once these wrong guesses get baked into training, they can mislead the model and reduce its reliability.

Smoothing Out Noisy Guesses

To tame these faulty pseudo-labels, the paper introduces a “soft smooth” contrastive learning scheme that uses two cooperating neural networks, a student and a teacher. The student is updated in the usual way during training, while the teacher is a slow-moving average of the student’s parameters. For each image, the teacher produces a gentle probability-style assessment of how well it fits each cluster, rather than a hard yes-or-no decision. This soft assessment is then blended with the student’s harder cluster assignment. The result is a smoothed target that tones down uncertain decisions and increases the influence of more reliable ones. In effect, the model learns to trust gradual trends over time instead of reacting sharply to every noisy update.

Remembering Both Differences and Common Ground

The second key idea is a “hybrid memory” that stores what the system has learned so far. Conventional methods maintain separate memories for visible and infrared images, which keeps track of differences but makes it hard to distill what is shared between the two. Here, the authors keep those two memories but also build a third: a blended memory that mixes the most similar visible and infrared examples. This hybrid memory acts as a meeting place, encouraging the network to discover features of a person that are stable across lighting conditions and sensors, such as overall body shape or clothing layout rather than color. A third component, adaptive-weight memory updating, gives more influence to unusual but trustworthy examples and less to ambiguous ones, so the memory evolves toward sharper, more globally useful representations.

Putting the Method to the Test

The team evaluates their approach, called Soft Smooth Contrastive Learning with Hybrid Memory (SCLHM), on three widely used datasets that include both visible and infrared footage collected by multiple cameras in realistic settings. They compare their system to many existing methods, including some that use full human labeling and others that work with partial labels or none at all. Across the board, SCLHM achieves state-of-the-art performance among label-free approaches, and in several cases comes close to or even rivals methods that rely on manual annotations. Additional experiments show that each of the three pieces—soft smoothing, hybrid memory, and adaptive updating—contributes meaningfully to the final accuracy.

Clearer Sight Around the Clock

For a general reader, the core message is that the authors have built a way for computers to teach themselves to recognize people across day and night cameras without requiring humans to name who is who. By smoothing out unreliable guesses and carefully combining what is unique to each camera type with what they share, their framework learns more stable and general patterns. This makes person tracking in complex, low-light environments more accurate and scalable, which could benefit security, traffic management, and other applications that depend on reliable, around-the-clock visual sensing.

Citation: Zhang, C., Su, Y., Wang, N. et al. Soft smooth contrastive learning with hybrid memory for unsupervised visible-infrared person re-identification. Sci Rep 16, 13951 (2026). https://doi.org/10.1038/s41598-026-44364-0

Keywords: person re-identification, infrared imaging, unsupervised learning, contrastive learning, surveillance