Clear Sky Science · en

Real-time underwater object detection via frequency-domain dynamics and spatially enhanced feature modulation

· Back to index

Seeing Clearly Beneath the Waves

The world under the sea is vital to food security, energy, and environmental health, but it is also hard to see. Murky water, drifting particles, and dim light can make even simple tasks like counting starfish or checking pipes on the seafloor surprisingly difficult. This study introduces a new computer vision method that helps underwater robots and cameras spot small sea creatures more accurately and more quickly, even when the view is blurry or clouded.

Figure 1. How a smart lightweight model helps robots see small sea animals clearly in murky underwater scenes in real time.
Figure 1. How a smart lightweight model helps robots see small sea animals clearly in murky underwater scenes in real time.

Why Underwater Vision Is So Hard

Unlike air, water bends and scatters light, especially the reds and yellows that our eyes rely on for contrast. Images taken underwater often look blue-green, hazy, and low in detail, with bright backscatter from floating particles. Small animals such as scallops or sea urchins may occupy only a few pixels in a picture and can easily blend into rocks, sand, or seaweed. Traditional object detection software, originally designed for sharp images on land, tends to miss these faint targets or mistake background clutter for real animals. At the same time, underwater robots and sensors usually run on limited hardware, so the detection method must be fast and lightweight, not just accurate.

A Faster Way to Read Noisy Images

The authors build on a recent family of models known as Detection Transformers, which scan an image by learning relationships between all its parts instead of sliding a small window over it. Their variant keeps the real-time speed of an earlier system called RT-DETR but swaps in a new backbone, named FasterFDBlock, that is better suited to noisy underwater scenes. This backbone combines a trick called partial convolution, which only processes a fraction of image channels to save time, with a frequency-based view of the picture. By working in the frequency domain, the model can tell apart random speckled noise from the sharp edges that outline animals, toning down the former while preserving the latter and reducing wasted computation.

Keeping Small Creatures in Focus

Deep vision networks often lose fine detail as they repeatedly shrink an image to extract higher-level patterns. That can be fatal for spotting tiny scallops or starfish that already sit near the edge of visibility. To fight this, the researchers redesign the core attention block in the encoder, creating what they call AIFI-SEFN. In simple terms, one branch of this module looks at the big picture using attention, while a companion branch focuses on local texture and shape. It pools and enlarges features across scales, uses lightweight convolutions to capture edges and patterns, and then gates how much of this detail is allowed through. The result is a richer blend of global context and crisp local structure, so small animals stand out more clearly against rough seabeds and plants.

Figure 2. How frequency filtering and multi-scale feature fusion turn a noisy underwater image into clear highlighted sea creatures.
Figure 2. How frequency filtering and multi-scale feature fusion turn a noisy underwater image into clear highlighted sea creatures.

Blending Information Across Scales

Underwater images rarely contain objects of a single size; the same type of organism might appear as a tiny speck in the distance or a large patch in the foreground. Simple ways of fusing information from shallow and deep layers, such as just adding feature maps together, can bury small details under heavy high-level signals or let shallow noise overwhelm the scene. The new Multi-scale Feature Modulation module tackles this by first summarizing what each layer "sees" through global pooling, then assigning adaptive weights to semantic and detailed features for every channel. These weights always add up to one, so the model must decide, channel by channel, whether detail or broad context matters more. This selective blending strengthens the signals from real targets and dampens distractions from rocks, sand, and shadows, without adding much extra cost.

How Well the Method Works

The team tested their approach on a challenging public dataset of underwater images that includes sea cucumbers, sea urchins, scallops, and starfish, many of them small, overlapping, or partially hidden. Compared to the original RT-DETR model, the new system raised the standard detection score (mean Average Precision) from 70.4 to 72.1 percent while cutting the number of parameters by over a quarter and reducing the amount of computation by nearly a quarter. It still runs at over 70 frames per second, fast enough for real-time use on typical graphics hardware. Visual comparisons of heatmaps and detection results show that the improved model locks onto actual animals more tightly, ignores confusing textures in rocks and seaweed, and recovers more tiny or low-contrast targets in murky or low-light scenes.

What This Means for Underwater Work

In everyday terms, this research shows how to teach a lean, fast model to see better in one of the most difficult visual settings on Earth. By carefully shaping how the network handles noisy frequencies, local detail, and features at different scales, the authors make underwater object detection both more accurate and more efficient. That balance is important for autonomous underwater vehicles and other field systems that must make quick, reliable decisions with limited computing power. As these methods are adapted to more datasets and embedded platforms, they could help scientists monitor marine life, engineers inspect underwater structures, and robots navigate complex seafloor terrain with greater confidence.

Citation: Cai, S., Zhu, A. Real-time underwater object detection via frequency-domain dynamics and spatially enhanced feature modulation. Sci Rep 16, 14884 (2026). https://doi.org/10.1038/s41598-026-44628-9

Keywords: underwater object detection, autonomous underwater vehicles, real-time vision, small object recognition, frequency-domain features