Clear Sky Science · en

Addressing the data imbalance issue in machine learning modeling of rare and disruptive outage events

· Back to index

Why better storm forecasts matter to you

When a major storm knocks out power, we experience it in very personal ways: no lights, no heat, spoiled food, and disconnected communication. Utilities try to predict these outages in advance so they can stage repair crews and keep people safe. But the very worst storms are rare, which means there is surprisingly little data about them. This paper shows how a new kind of artificial intelligence can "imagine" realistic rare storms, filling in the gaps in our records and making outage forecasts more accurate when it matters most.

Figure 1
Figure 1.

The challenge of learning from rare disasters

Most power outages are caused by weather, especially hurricanes, Nor’easters, snow and ice storms, and severe thunderstorms. These events are becoming more intense as the climate warms, putting extra stress on aging power grids. Yet the most damaging storms are, by definition, uncommon. Traditional statistical tools and machine-learning models tend to learn best from the many mild and moderate storms, and they struggle with the handful of truly extreme cases. This imbalance in the data leads to underestimates of damage just when utilities most need reliable guidance.

Teaching computers to create new storms

To overcome this imbalance, the authors build a system that generates synthetic storms—that is, computer-created events that look and behave like real storms but are not copies of any single past event. They focus on Connecticut, representing each storm as a grid of 815 cells with 19 types of information per cell, including wind, rain, pressure, turbulence, vegetation, and power-line layout. First, they group 294 historical storms into 12 clusters based on how many and where "trouble spots"—damage locations that crews must repair—occurred. The rare, high-impact storms end up in four small clusters that need boosting.

How the new AI model builds realistic extremes

The core of the framework combines two modern AI tools. A variational autoencoder compresses each multi-layer storm map into a lower-dimensional "latent" representation that still preserves important patterns, such as stronger winds near the coast. On this compressed space, a diffusion model learns to start from random noise and gradually refine it into a realistic storm, conditioned on which outage-severity cluster is requested. The system then screens the generated storms using a set of metrics that compare their statistics to real events—checking not only individual features like wind speed but also how features move together, as captured by correlation patterns. Only synthetic storms that closely match the physical and statistical behavior of real storms in a given cluster are kept.

Figure 2
Figure 2.

Putting synthetic storms to the test

The authors then ask the crucial question: do these synthetic storms actually help predict outages? They train an existing outage prediction model twice—first on real storms only, and then on the same data enriched with carefully screened synthetic events for the rare, high-impact clusters. They evaluate performance using a strict leave-one-storm-out test, which mimics forecasting new, unseen events. With synthetic enrichment, the model’s structural error drops sharply and overall fit improves. For the rare, most disruptive storms, central root mean squared error falls by about 45%, and summary measures of skill such as the Nash–Sutcliffe efficiency rise from worse-than-baseline levels to clearly useful performance. A comparison with a "random" augmentation, which adds synthetic storms without quality screening, shows much smaller or even negative gains, underscoring the importance of rigorous filtering.

What this means for future storms

In plain terms, this study shows that letting AI invent physically consistent extreme storms—and being selective about which invented storms you trust—can make outage forecasts more reliable for the very events that cause the greatest harm. By enriching sparse data on rare but devastating weather, the approach helps utilities better anticipate how many damage locations they will face and where. Although demonstrated for one state and one type of hazard, the same strategy could be extended to wildfires, floods, and other natural threats, offering a new way to strengthen infrastructure planning in a world of growing climate extremes.

Citation: Azizi, M., Zhang, X., Yasenpoor, T. et al. Addressing the data imbalance issue in machine learning modeling of rare and disruptive outage events. Sci Rep 16, 8876 (2026). https://doi.org/10.1038/s41598-026-41838-z

Keywords: synthetic storm data, power outage prediction, diffusion models, extreme weather, data imbalance