Clear Sky Science · en
Uncertainty aware and explainable construction cost prediction using a hybrid probabilistic learning model
Why the Price of a Building Matters Before It’s Built
Before a shovel hits the ground, owners and builders must guess how much a project will really cost. If that guess is wrong, budgets are blown, schedules slip, and trust between partners erodes. This paper introduces a new way to forecast construction costs that not only aims to be accurate, but also tells you how confident the forecast is and why the model thinks a project will be expensive or cheap. That combination of accuracy, honesty about uncertainty, and clear explanations is what makes this work important for anyone interested in how data and artificial intelligence can make big projects less risky.

The Trouble With Traditional Cost Guessing
For decades, construction costs have been estimated using expert judgment and simple statistics. These methods draw heavily on past projects and human experience. They are easy to understand, but they struggle when projects become more complex, when markets are volatile, or when many factors interact in tangled ways. Classic statistical tools assume straight-line relationships between inputs and costs, even though real projects are affected by jumps in material prices, design choices, site conditions, and changing labor markets. Many newer machine‑learning models can capture these messy patterns, but they often act like black boxes and typically produce a single best guess, without indicating how wrong that guess might be.
A New Hybrid Model for Smarter Forecasts
The authors propose a hybrid learning system called NGBoost‑ETR that tries to tackle three problems at once: accuracy, uncertainty, and transparency. At its core is a technology known as Natural Gradient Boosting (NGBoost), which is designed to output not just a cost estimate, but an entire probability curve around that estimate. Instead of saying, “the slab will cost 17 dollars per square foot,” the model effectively says, “17 is the most likely value, but here is how much higher or lower it could realistically be.” To make NGBoost more powerful on construction data, the authors replace its usual simple trees with a stronger tree‑based learner called Extra Trees Regression, which is especially good at capturing non‑linear relationships between inputs such as slab type, floor area, loads, and material unit prices.
How the Model Was Tested in the Real World
To see if the approach works in practice, the researchers trained and tested their model on 4,477 real entries from RSMeans, a widely used U.S. cost database for building components. Each record describes a structural floor assembly—such as one‑way slabs, flat slabs, or waffle slabs—along with its area, expected loads, and the unit cost of concrete and formwork. The model’s performance was compared with 10 popular machine‑learning methods and 9 other NGBoost hybrids. Standard accuracy measures showed that NGBoost‑ETR produced some of the best point predictions, with very small average errors on unseen data. Just as important, the team evaluated how well the model’s predicted ranges matched reality, using a suite of six uncertainty metrics that judge both how often the actual cost falls inside the predicted band and how narrow that band is.

Seeing Which Factors Drive Cost
Because decision‑makers are reluctant to trust a black box, the authors weave explainability into their design using SHAP, a modern technique that assigns each input a contribution score for every prediction. This allows users to see, for example, how much high formwork prices, a particular slab type, or a large tributary area pushed a forecast up or down for an individual project. Across the dataset, formwork cost emerges as the single most influential driver of total cost, followed by slab type and area. By tying these explanations directly to the model’s central cost estimate, practitioners can scrutinize whether the predictions align with their domain knowledge and adjust designs or negotiations accordingly.
What This Means for Future Projects
Overall, the NGBoost‑ETR framework delivers highly accurate cost forecasts, relatively tight yet well‑behaved uncertainty ranges, and clear insight into which variables matter most. Some competing models offered slightly higher coverage of the true costs, but only by predicting unrealistically wide ranges that are of little practical use. The new model instead strikes a balance: it may miss the exact cost slightly more often than an extremely conservative method, but when it does provide a range, that range is compact enough to inform real‑world budgeting, bidding, and risk planning. For lay readers, the key takeaway is that this research moves construction cost prediction from educated guesswork toward a more honest, data‑driven “weather forecast” for project budgets—one that tells you not just what is likely to happen, but how sure we can be.
Citation: Chen, L., Khalid, O.W., Tiang, JJ. et al. Uncertainty aware and explainable construction cost prediction using a hybrid probabilistic learning model. Sci Rep 16, 10973 (2026). https://doi.org/10.1038/s41598-026-44904-8
Keywords: construction cost forecasting, probabilistic machine learning, project risk management, explainable AI, infrastructure planning