Clear Sky Science · en

Semantic-aware self-supervised learning using progressive sub-action regression for action quality assessment

· Back to index

Seeing Performance Through a New Lens

When we watch Olympic divers or other elite athletes, we instinctively sense who performed better, but turning that intuition into objective numbers is hard. Today’s automated video systems can assign an overall “score” to an action, yet they rarely explain why a dive was good or bad, or which part needs work. This paper presents a new way for computers to watch complex actions in video, break them into understandable pieces, and score each piece separately—offering feedback closer to what a human coach might give.

Figure 1
Figure 1.

Breaking a Complex Move Into Manageable Pieces

Many current action-quality tools treat a full dive or movement as a single block, producing just one overall score. That hides crucial detail: a diver might launch perfectly but enter the water poorly, and a single number cannot reveal this. The authors tackle this by teaching the computer to split each video into meaningful stages, or sub-actions, such as start, takeoff, flight, and entry. Importantly, this splitting is done automatically, without any human marking of where one stage ends and the next begins. An unsupervised clustering method groups neighboring frames that “behave” similarly over time, giving the system a rough but reliable storyboard of the performance.

Letting the System Teach Itself What Matters

Once the video is divided into stages, the system needs to understand what each stage looks like when it is performed well or poorly. Instead of relying on dense, hand-made labels, the authors use self-supervised learning: the model is shown many versions of the same sub-action where chunks of frames are deliberately removed or “masked.” The system must still produce similar internal descriptions for both the complete and the partially missing clips. By learning to ignore these artificial gaps, it becomes robust to real-world issues such as brief occlusions, missed frames, or slightly inaccurate stage boundaries, and it learns to focus on the essential patterns of motion and posture that define quality.

Figure 2
Figure 2.

From One Overall Score to Many Helpful Sub-Scores

Real datasets usually contain only a single overall score for each dive, not separate ratings for each stage. To overcome this, the authors introduce a progressive “pseudo-subscore” strategy. First, they fuse the overall score with the newly learned features for each sub-action and train small networks to guess a provisional score for each stage. Then, they refine these guesses by allowing information to flow along the sequence: each stage’s features are updated using the scores of earlier stages, capturing how a small mistake at takeoff can ripple into flight and entry. In a second variant, every stage has access to all previous stage scores, modeling long-range cause and effect throughout the action. Finally, a compact regression network combines the refined stage scores into an overall prediction, now without needing to see the ground-truth score at its input.

Testing on Real Diving Competitions

The researchers evaluated their framework on two demanding diving datasets recorded from major international competitions. These collections provide overall scores from human judges, and in some cases rough stage timing, but no stage-level quality labels. The new method achieved state-of-the-art rank correlation, meaning its ordering of athletes closely matches that of expert judges, while also reducing numerical errors in the predicted scores. Careful “ablation” tests showed that both main ideas—self-supervised feature refinement and progressive pseudo-subscore modeling—contribute substantial improvements. Notably, using automatic stage boundaries performed almost as well as using painstaking human annotations, indicating that the system is resilient to imperfect segmentation.

Turning Numbers Into Insightful Coaching Tips

Beyond accuracy, this approach makes automated scoring more interpretable. By assigning a separate score to each stage of a dive, the system can highlight, for example, that two divers share similar takeoffs and flight phases but differ sharply at entry, where one creates a large splash. The analysis of many samples confirms that these stage scores follow the same priorities as human judges, with the entry phase often carrying the most weight. In practical terms, the method can point athletes and coaches to the exact part of a performance that needs improvement, while still working from relatively simple training data. Though demonstrated on diving, the concept is flexible enough to extend to other multi-step tasks—from surgical procedures to rehabilitation exercises—where understanding how each segment contributes to overall quality is key.

Citation: Mazruei, M., Fazl-Ersi, E., Vahedian, A. et al. Semantic-aware self-supervised learning using progressive sub-action regression for action quality assessment. Sci Rep 16, 6670 (2026). https://doi.org/10.1038/s41598-026-36668-y

Keywords: action quality assessment, sports video analysis, self-supervised learning, human motion scoring, deep learning for coaching