Clear Sky Science · en
Athlete action quality assessment based on transfer neural network quality score decoupling in complex sports scenarios
Why smarter sports judging matters
From Olympic diving to breakdance battles, many sports depend on human judges to turn intricate movements into a single score. But long performances are uneven: some moments are spectacular, others are shaky or simply filler. This study explores how artificial intelligence can watch entire videos of complex performances, pick out the truly important moments, and produce more consistent, fine-grained scores that can support judges, coaches, doctors, and everyday learners.
Watching the whole show, not just the highlight reel
Traditional computer systems that rate athletic performance often treat a full video as if every second matters equally. That assumption breaks down in real events. In breakdancing, for example, early steps that match the music matter less than difficult floor moves, freezes, or power spins later on. Existing methods frequently smooth everything together, which hides both brilliant moves and critical mistakes. The authors frame this as a general problem in long skill videos: quality is uneven over time, and positive and negative evidence can coexist within the same performance. Their goal is to build a system that separates the key moments from background motion, making it easier to compare how well two people actually performed.

Two ways of looking at the same performance
The proposed model looks at each video through two separate lenses. One “dynamic” stream focuses on movement over time using short clips, capturing rhythm, flow, and continuity. The other “static” stream examines individual frames, picking up on posture, body control, and small form errors that may appear only for an instant. Crucially, these streams do not get mixed early. Each first learns its own view of the performance, which helps prevent brief posture mistakes from being drowned out by long smooth sequences, or vice versa. Only after each stream has formed its own quality-aware features are they combined to estimate an overall score.
Separating strong moves from weak ones
At the heart of the system is a “score decoupling” module that explicitly separates video segments that look like strong evidence of skill from those that suggest weaker or flawed execution. Inspired by modern attention-based networks, the model learns two internal “prototypes”: one that seeks out high-quality moments and another that focuses on low-quality ones. As the video is processed, each prototype assigns different weights to different segments, producing two complementary summaries: one built from the best-looking clips, and one from the worst or least helpful clips. A simple average over time is also kept as a neutral baseline. Special training rules push the high- and low-quality views to disagree in useful ways and to focus on different parts of the video, rather than collapsing onto the same few obvious frames.

Learning to rank performances by watching pairs
Instead of relying on precise numeric scores from human experts, the system is trained mainly on pairwise comparisons: given two videos, which performer showed better skill overall? For each pair, the model predicts scores for their high-quality, low-quality, and average branches and is penalized if it gets the ordering wrong or if the separated branches fail to be more discriminating than the simple average. Additional training terms encourage the “good” and “bad” views to emphasize different time segments. Once training is complete, the system can watch a single new video and output one stable quality score, without needing to see a reference video alongside it.
From breakdancing battles to surgery and everyday skills
To test their approach, the authors built a new dataset of world-class breakdancing battles and also evaluated the method on two existing collections of long skill videos: everyday tasks such as drawing, cooking, and tying a tie, and surgical and fine-motor activities. Across these diverse settings, their model typically matched or exceeded the accuracy of leading methods at deciding which of two videos shows higher skill. Visualizations of its internal attention maps show that high-quality branches tend to light up around well-controlled, technically demanding moves, while low-quality branches emphasize awkward transitions or incomplete actions. For a lay reader, the bottom line is that this system teaches computers not just to recognize what action is happening, but how well it is done, by carefully separating the best and worst parts of a performance before combining them into a final, interpretable score.
Citation: Gao, L., Ma, Y., Bi, S. et al. Athlete action quality assessment based on transfer neural network quality score decoupling in complex sports scenarios. Sci Rep 16, 15795 (2026). https://doi.org/10.1038/s41598-026-43987-7
Keywords: action quality assessment, sports video analysis, breakdancing, attention-based models, skill evaluation