Clear Sky Science · en

General scales unlock AI evaluation with explanatory and predictive power

2026-04-01 · Back to index

Why we need better report cards for AI

As artificial intelligence systems move from chatbots and coding helpers into science labs, classrooms and workplaces, it becomes crucial to know what they can and cannot do. Today’s AI report cards are mostly single test scores on narrow benchmarks, which say little about why a system succeeds or fails—or how it will behave on a new kind of problem. This paper proposes a new way to measure AI that aims to be as systematic and durable as temperature scales are for weather, giving us clearer insight into AI strengths, weaknesses and future performance.

From scattered tests to shared scales

Most current AI evaluations resemble school exams designed one at a time: each benchmark mixes together many skills and difficulties, and the final grade is a single percentage. That percentage depends as much on the quirks of the test as on the abilities of the AI. The authors argue that this makes it impossible to predict performance on new tasks and leads to confusion—for example, when one math benchmark says a model "reasons well" and another suggests the opposite. Instead of only averaging scores, they propose to describe each task in terms of how much it demands along a set of general, human-understandable scales.

Building a common ruler for AI abilities

To create this common ruler, the team designed 18 demand scales that cover broad mental skills and knowledge areas. These include abilities like understanding language, following chains of reasoning, reflecting on one’s own knowledge, and knowing facts from natural, social, applied and formal sciences. They also track “extraneous” demands that can make problems harder or easier without changing the underlying skill, such as how unusual a question is, how much information it piles on, or whether it is multiple-choice. Each scale runs from zero demand to increasingly challenging levels, roughly aligned so that moving up a level means that far fewer people—or AIs—should be able to solve the item.

Teaching machines to label what tasks really ask for
Figure 1.

Manually scoring thousands of questions along 18 scales would be impossible for expert panels alone, so the authors use advanced language models themselves as annotators. They write detailed rubrics with examples for every level of every scale, then ask a model (GPT‑4o) to assign demand levels to over 16,000 questions drawn from 20 modern AI benchmarks. Human experts check a subset and reach strong agreement with the model’s labels. Once annotated, each benchmark can be visualized as a “demand profile” showing how much it really exercises each ability. This reveals that many celebrated tests do not measure what their designers intended: some claim to focus on reasoning but actually hinge on obscure factual knowledge, others cluster at a single difficulty level, and almost none are both sensitive (covering a good spread of levels) and specific (avoiding unintended skills).

Reading AI ability curves instead of raw scores

With the same scales used on tasks, the next step is to see how different AI systems handle increasing demands along each dimension. The authors test 15 large language models from three major families and look, for each scale, at the chance of success as tasks get harder. Fitting smooth curves through these points yields an “ability level” for every model on every scale: the demand level at which it succeeds about half the time when other demands are not higher. Unlike raw accuracy, these ability scores do not depend on the particular mix of easy and hard items in a benchmark. The resulting profiles show clear patterns: larger models mainly improve factual knowledge, while special “reasoning” models gain more in numerical and logical thinking, in identifying relevant information, and even in modelling other minds and social situations. The curves also reveal diminishing returns: simply adding more parameters eventually yields only modest ability gains.

Using demand profiles to forecast and control AI behavior
Figure 2.

Because both tasks and systems now live on the same set of scales, the authors can treat evaluation as a prediction problem. They train simple machine-learning “assessors” that take only the 18 demand levels for a question as input and output the probability that a particular AI will answer correctly. These assessors predict success very accurately, not just on familiar tasks but also on entirely new ones and on benchmarks left out of training. They outperform much heavier black‑box approaches that rely on text embeddings or fine‑tuning large models directly. This enables practical uses such as routing each incoming query to the model most likely to handle it safely, or rejecting queries that fall outside any model’s reliable zone before harm is done.

A step toward a science of AI evaluation

The authors conclude that general demand and ability scales can transform how we judge and deploy AI. Instead of chasing ever larger, short‑lived benchmarks and opaque aggregate scores, we can build a stable, extensible measurement framework that explains why systems fail, compares them fairly across domains, and anticipates their behavior on new tasks. Much like standardized units in physics made precise engineering possible, a shared, well‑designed set of cognitive scales could underpin safer and more predictable use of AI in the years ahead.

Citation: Zhou, L., Pacchiardi, L., Martínez-Plumed, F. et al. General scales unlock AI evaluation with explanatory and predictive power. Nature 652, 58–67 (2026). https://doi.org/10.1038/s41586-026-10303-2

Keywords: AI evaluation, benchmarking, large language models, predictive assessment, AI safety