Clear Sky Science · en

A spectral test of the butterfly effect and physical consistency in the diffusion-based GenCast’s ensembles

· Back to index

Why tiny weather errors matter

Weather forecasts do not fail just because we lack data; they also fail because small mistakes in today’s atmosphere can grow over time, an idea often called the butterfly effect. As powerful artificial intelligence models begin to rival traditional physics based weather models, it becomes crucial to ask not only whether these systems score well on common accuracy charts but also whether they behave like the real atmosphere when tiny errors grow and spread.

Figure 1. Comparing traditional physics weather ensembles with AI generated ensembles to see how forecast errors grow and spread.
Figure 1. Comparing traditional physics weather ensembles with AI generated ensembles to see how forecast errors grow and spread.

Old tools and new tools for forecasting

Modern weather agencies rely on physics based computer models that simulate winds, clouds, and temperature around the globe. These models, such as the European Centre’s forecast system, run many slightly different versions at once to show how uncertainty grows in time. Recently, deep learning weather prediction models like GenCast have appeared. Instead of solving physical equations step by step, GenCast learns from decades of reanalysis data and adds carefully structured noise to create large ensembles of forecasts very quickly. This study asks whether these AI made ensembles behave like those from traditional models when it comes to the spread of errors across different spatial scales.

Following energy across big and small scales

The authors focus on the kinetic energy of winds high in the atmosphere near jet streams, where small disturbances can grow rapidly. They look at how differences between ensemble members evolve with time and how energy is distributed across scales from planetary waves spanning thousands of kilometers down to mesoscale features a few hundred kilometers wide. In the traditional ensemble model, small scale errors either grow and then feed larger patterns or are damped by physical processes such as turbulence and convection. This behavior leaves a characteristic curve when energy is plotted as a function of scale, with predictable slopes at different size ranges.

What the AI model gets right and wrong

GenCast captures some aspects of this behavior. At medium scales associated with typical weather systems, its error growth looks similar to that of the traditional ensemble, and its large scale flow patterns resemble those from the reference models and from reanalysis. However, two major problems emerge. First, at the very largest scales, GenCast does not show the strong growth of error seen in the physics based ensemble, suggesting that its representation of the butterfly effect is incomplete. Second, at smaller mesoscale ranges, the energy curve flattens into a nearly constant tail from the first forecast step and barely changes with time. When the wind is decomposed into components and examined on maps, these small scale features look broad, noisy, and almost like random grain instead of sharp filaments aligned with jet streams or shaped by mountains.

Figure 2. How injected noise in an AI weather model creates static small scale patterns instead of realistic evolving jets and flows.
Figure 2. How injected noise in an AI weather model creates static small scale patterns instead of realistic evolving jets and flows.

Clues from maps of sharpness and terrain

By separating large and small scales and examining how they interact, the authors find that the physics based models show clear signatures of topography, such as alternating patterns over the Andes that indicate mountain waves and drag on the flow. GenCast lacks these clear signals and instead shows very broad, intense small scale energy that varies wildly from one ensemble member to another. When they look at how quickly kinetic energy changes in space, the traditional models display thin, meandering lines that trace jet cores, while GenCast produces thick, isotropic belts of noisy gradients that barely evolve over ten days. This suggests that the AI model’s noise injection produces variance that looks energetic in a statistical sense but does not follow the pathways of real atmospheric dynamics.

Why this matters for future AI weather models

The study concludes that GenCast can produce skillful forecasts and realistic large scale patterns but fails important tests of physical consistency at smaller scales. Its treatment of noise leads to persistent, non physical features that do not diffuse or cascade energy as in true atmospheric flows, and that weaken the realistic spread of uncertainty on planetary scales. For AI weather prediction to become a fully trustworthy partner to traditional models, its developers will need to rethink how random disturbances are added and controlled so that ensembles not only match scores but also respect the physics behind the butterfly effect.

Citation: Kim, H., Ryu, J., Son, SW. et al. A spectral test of the butterfly effect and physical consistency in the diffusion-based GenCast’s ensembles. npj Clim Atmos Sci 9, 110 (2026). https://doi.org/10.1038/s41612-026-01380-1

Keywords: AI weather forecasting, ensemble prediction, butterfly effect, kinetic energy spectra, GenCast model