Clear Sky Science · en

Supervised learning for predicting unknown modifying variables in pliable lasso

2026-02-23 · Back to index

Why hidden influences matter for predictions

From spotting credit card fraud to forecasting disease risk, computers now make predictions that touch many parts of daily life. But real-world data are messy: the same input, like age or temperature, can matter differently depending on hidden context, such as gender, time period, or lab conditions. This paper explores how to handle such “invisible” influences when they are known for past data but missing for new cases, and shows how combining different machine learning tools can lead to predictions that are both accurate and easier to interpret.

Connecting inputs, hidden context, and outcomes

The study focuses on a powerful regression method called the pliable lasso. In simple terms, this method predicts an outcome (like blood pressure) from many features (such as age or protein levels) while also allowing a separate set of “modifying” variables to bend or reshape those relationships. For example, the effect of exercise on blood pressure might differ by gender. The pliable lasso is designed to capture these context-dependent effects while automatically keeping the model from becoming needlessly complicated. It does this by favoring simple patterns unless the data clearly support more complex interactions.

Three ways to handle missing context

The authors describe three common situations for these modifying variables. In the easiest “known-known” case, the modifiers are recorded for both training and future data, so the pliable lasso can be fitted once and directly applied. In a more challenging “known-unknown” case, the modifiers are available only in the training data and must be estimated for new observations. In the hardest “unknown-unknown” case, modifiers are never observed and must be approximated indirectly, for example by clustering similar individuals. This work zeroes in on the middle, practically important case: modifiers are known for old data, but must be predicted for new data before the pliable lasso can use them.

Testing many learning methods side by side

To estimate the missing modifiers, the authors systematically compare eight supervised learning algorithms, including Random Forests, XGBoost, decision trees, support vector machines, k-nearest neighbors, artificial neural networks, Lasso, and Elastic Net. They evaluate two steps at once: first, how well each method classifies the modifiers themselves; second, how well the overall pliable lasso pipeline predicts the final outcome once those estimated modifiers are plugged in. The tests span both carefully designed simulated data and two real datasets: protein expression in mouse brains and material properties in superconductors. Rigorous cross-validation and careful hyperparameter tuning are used to avoid overly optimistic results and information leakage between training and test sets.

What works best and why

The results reveal an interesting tension. Tree-based ensemble methods such as XGBoost, Random Forest, and single decision trees excel at classifying the hidden modifiers, often with near-perfect scores. Yet, they do not always deliver the best final outcome predictions once their modifier estimates are fed into the pliable lasso. Instead, simpler, regularized linear models like Lasso and Elastic Net tend to produce the most accurate and stable outcome predictions, even when their modifier classification is slightly less perfect. The authors argue that this happens because tree-based methods can produce very sharp but occasionally wrong modifier labels that distort the delicate interaction structure in the pliable lasso, while regularized linear methods yield smoother, “softer” estimates that align better with the model’s assumptions.

A practical take-home recipe

For practitioners who want strong, interpretable predictions in settings where important contextual factors are only partially observed, the study recommends a hybrid strategy. First, use powerful tree-based models to estimate the missing modifiers, taking advantage of their strength in finding complex patterns. Then, combine these estimated modifiers with the original features inside a pliable lasso model, ideally paired with Lasso or Elastic Net for the final regression step. This two-stage approach exploits the best of both worlds: flexible discovery of hidden structure, followed by a disciplined, transparent model for predicting outcomes.

Citation: Hawrami, Z.S.M., Cengiz, M.A. & Dünder, E. Supervised learning for predicting unknown modifying variables in pliable lasso. Sci Rep 16, 10200 (2026). https://doi.org/10.1038/s41598-026-36854-y

Keywords: pliable lasso, modifier variables, supervised learning, hybrid modeling, interaction effects