Clear Sky Science · en
Comparison of primary analysis strategies of randomized controlled trials with multiple endpoints with application to kidney transplantation
Why this matters for patients and trials
When doctors test new treatments, especially for people who have received a kidney transplant, they want to know not just whether patients live longer, but also whether they keep their new kidney and avoid serious side effects like infections. No single outcome tells the full story. This paper uses large-scale computer simulations to ask a practical question: when a trial follows several important outcomes at once, which statistical strategy best balances clear answers, fairness to patients, and the limited size of real-world studies?
Different ways to judge success
The authors focus on randomized controlled trials that track several key events after kidney transplantation: death, loss of the transplanted kidney, episodes of rejection, and serious infections. Instead of picking just one of these, three main strategies are widely discussed in regulatory guidance. The first combines several events into a single “any bad event” outcome, so that the trial asks whether the new treatment delays or prevents the first such event. The second tests each event separately but adjusts the rules so that looking many times does not increase the chance of a false positive. The third, called generalized pairwise comparisons, ranks outcomes by clinical importance and compares patients in the two groups one pair at a time: first by the most important event, then by less critical ones only when the first is inconclusive.
How the simulations were built
Because it is hard to work out exact formulas for how these strategies behave in complex settings, the researchers used clinical trial simulations. They invented thousands of “virtual trials” under a wide range of realistic scenarios: different sample sizes, different event rates for each outcome, varying sizes of treatment benefit or harm, and varying degrees of correlation between outcomes. Some scenarios reflected kidney transplant realities, where death and graft loss are rare but infections are common; others included a “terminal” event like death that prevents later outcomes from being seen, or allowed outcomes to be correlated without such blocking. In every simulated trial, they applied each analysis strategy and recorded whether it would have declared the treatment successful.

What they found about overall power
Across most scenarios with time-to-event outcomes, the strategies that combine information into a single global test—the composite endpoint and generalized pairwise comparisons—were more powerful than the multiple-testing approach. That means they were more likely to detect a true treatment benefit when one existed, especially when the treatment helped across several outcomes. Generalized pairwise comparisons were often slightly more powerful than the composite, particularly when benefits were present on all prioritized outcomes. However, their performance depended strongly on which event was placed highest in the priority order and how often that event occurred. By contrast, multiple testing with correction tended to be less sensitive, but its performance improved as trials became larger and when some low-frequency but highly important events still showed a clear treatment effect.
Hidden trade-offs and tricky situations
The simulations also revealed important caveats. When a frequent but less severe outcome, such as infection, dominates the combined measure, the composite endpoint can show a statistically significant benefit even if there is little or no improvement—and in extreme cases, even some worsening—in rare but more serious outcomes like death or graft loss. Generalized pairwise comparisons partly address this by giving higher weight to the most serious events, but they can lose power if that top-priority event is common yet unaffected by treatment, because many patient comparisons stop at that level and never consider beneficial changes in lower-priority outcomes. Multiple testing, while less powerful overall, offers clearer insight into which specific outcome drives a positive or negative result, at the cost of needing stronger effects or larger samples to reach significance after adjustment.

Influence of correlations and opposing effects
The behavior of all three strategies changed when outcomes were correlated—such as when patients who lose their graft are also more likely to die—or when treatment had opposite effects on different outcomes. Strong positive correlations often reduced power for composite endpoints and generalized pairwise comparisons, because highly linked components carry less independent information than loosely connected ones. In scenarios with opposing effects, the global methods—especially when they emphasized more important events—were less likely to declare success if harm appeared in top-priority outcomes, even when lower-priority outcomes improved. Still, they often remained more powerful than the adjusted multiple-testing approach, provided the main “driving” outcome benefited from treatment.
What this means for future trials
For readers outside statistics, the main message is that there is no one-size-fits-all way to judge complex treatments. Combining outcomes into a single measure or using pairwise comparisons can make trials smaller and more efficient, helping to detect real benefits in kidney transplantation and similar settings. But these approaches can also hide which specific outcomes improved or worsened, and may be strongly influenced by how outcomes are prioritized or correlated. The authors conclude that trial designers should balance statistical efficiency with clarity: global tests can be used for the main decision, but should always be accompanied by a careful, outcome-by-outcome look to ensure that apparent benefits are not masking important harms.
Citation: Herkner, F., Posch, M., Bond, G. et al. Comparison of primary analysis strategies of randomized controlled trials with multiple endpoints with application to kidney transplantation. Sci Rep 16, 8769 (2026). https://doi.org/10.1038/s41598-026-38979-6
Keywords: kidney transplantation trials, composite endpoints, multiple endpoints analysis, generalized pairwise comparisons, clinical trial simulation