Clear Sky Science · en

VALORIS: One-shot and lossless vertical logistic regression for privacy-protecting multi-site health analytics

2026-03-08 · Back to index

Why sharing health data is so hard

Modern medicine increasingly depends on combining information from many sources: hospital records, lab results, images, and even genetic data. Yet these pieces of information usually sit in different organizations that are not allowed—legally or ethically—to pool their detailed patient records in one place. This makes it difficult to run the kinds of statistical analyses that help doctors predict who is at risk of serious outcomes like kidney failure or death in intensive care. The study introduces VALORIS, a new way to perform a popular type of analysis across multiple sites while keeping every patient’s raw data safely at home.

Many pieces of one patient’s story

To understand the challenge, imagine a child with chronic kidney disease whose story is split across systems. One hospital database holds age, sex, and measures of kidney function. Another stores blood test results. A third system might track long-term outcomes such as kidney failure. Each site holds different columns of information about the same children, a situation called a “vertical” split. None of these organizations want to reveal their detailed records, and some are not even allowed to reveal the outcome—such as whether kidney failure occurred—outside their walls. Still, researchers would like to build a single predictive model that uses all of this scattered information as if it were in one place.

A one-shot way to learn from many sites

VALORIS tackles this problem for logistic regression, a workhorse method used to study how multiple factors together relate to a yes–no outcome, such as failure of an organ or death in hospital. Instead of shipping patient-level data around, each site performs a compact local calculation on its own data, summarizing patterns of how variables vary together. These summaries, which look like mathematical matrices, are sent once to a special role called the response node, where the outcome is stored. The response node combines the summaries, runs a single optimization step, and then sends carefully crafted intermediate numbers back to each site. Using only these shared quantities, every site can reconstruct the exact regression results for its own variables—without ever seeing another site’s raw records or the full outcome list.

As accurate as putting all the data in one place

Whenever we replace a standard analysis with a privacy-protecting version, a key worry is: do we lose accuracy? The authors show that VALORIS can be tuned so that its answers are, for all practical purposes, identical to the answers we would get from a traditional pooled analysis. They do this by solving a slightly modified version of the usual logistic regression problem that includes tiny penalty terms. Mathematical arguments and numerical experiments show that when these penalties are chosen small enough, the resulting estimates and their margins of error become indistinguishable from the gold-standard centralized solution, while still being computable from split data.

Real-world tests on kidney disease and intensive care

To show that the method works beyond theory, the team applied VALORIS to two real health studies. The first focused on children with chronic kidney disease treated at Necker-Enfants Malades Hospital in Paris. Here, one node held basic characteristics and the outcome of kidney failure within two years, while another held blood test results. VALORIS produced estimates of how each factor related to kidney failure that matched the standard combined-data analysis to within less than one ten-thousandth on average. The second test used a much larger dataset, the MIMIC-IV intensive care database, split into three nodes representing emergency, hospital ward, and intensive care information. Again, VALORIS reproduced centralized results almost exactly, even with over ten thousand patients and many variables.

Building in privacy, not just promising it

Many so-called “privacy-preserving” methods simply avoid sending raw records, yet still leak enough information for a determined partner to reconstruct individuals’ data. The authors therefore introduce a stronger requirement: after all messages have been exchanged, no party should be able to uniquely recover any person’s data from what they see. They analyze, step by step, what each site receives during VALORIS and prove that, under realistic conditions—such as having at least one continuous numerical variable at a site outside any potential attacker—there are always many different underlying datasets that could have produced the same shared numbers. They also provide a practical check, based on optimization, that the response node can run before sending anything out to confirm that this stronger level of protection is met for a given project.

What this means for future health studies

In plain terms, VALORIS shows that hospitals and research networks do not always need to choose between strong privacy and high-quality results. For logistic regression, they can keep their detailed records behind their own firewalls, exchange only limited summaries in a single communication round, and still recover results that are effectively identical to those from a traditional pooled analysis. This makes it easier for busy clinical partners to participate, reduces approval hurdles around data sharing, and opens the door to large-scale studies that combine clinical, laboratory, and other data sources. The authors suggest that similar ideas could be extended to other models and to settings with missing data, helping future health research respect patient confidentiality while still gaining the statistical power that comes from working together.

Citation: Camirand Lemyre, F., Domingue, MP., Morissette, JP. et al. VALORIS: One-shot and lossless vertical logistic regression for privacy-protecting multi-site health analytics. Sci Rep 16, 12558 (2026). https://doi.org/10.1038/s41598-026-41936-y

Keywords: privacy-preserving health analytics, distributed logistic regression, multi-site medical data, federated statistical modeling, electronic health records