Clear Sky Science · en

Importance of balanced datasets with feature selection and ensemble methods on heart disease classification using distinctive machine learning techniques: a comparative analysis

2026-04-07 · Back to index

Why this matters for everyday hearts

Heart disease is still the world’s top killer, yet most people who take health surveys or visit clinics never see their answers turned into early warnings. This study asks a simple but powerful question: if we clean and rebalance large health datasets, carefully choose the most telling risk factors, and then pick the right type of computer model, can we do a noticeably better job of spotting who is likely to develop heart problems?

Turning messy health data into something useful

The researchers worked with a large public dataset from the U.S. Behavioral Risk Factor Surveillance System, which contains self-reported information from thousands of adults about their health and habits. Each person is described by 17 everyday features such as age, smoking and drinking status, sleep time, physical activity, diabetes, kidney disease, and overall self-rated health, along with whether they have heart disease. Like most real-world medical records, the data were messy: some values were missing, some people were clear outliers, and far fewer people reported heart disease than not. The team first cleaned the data, filled in missing values, removed extreme outliers, and then split the records into separate groups for training and testing the computer models.

Fixing the problem of rare cases

One big obstacle was imbalance: people without heart disease greatly outnumbered those with it. In such situations, a model can appear accurate simply by guessing “no disease” most of the time, while missing many true cases. To counter this, the authors used a technique called oversampling, which creates realistic synthetic examples of the rarer “heart disease” cases so that the training data contain roughly equal numbers of positive and negative outcomes. This balancing step improved the ability of several models to find people with heart disease, but on its own it did not make the predictions reliably sharp or discriminating.

Picking the most telling risk factors

The study then asked which pieces of information about a person matter most for prediction. The authors tested three families of statistical tools that score each feature by how strongly it relates to heart disease. They evaluated them alone and in eight different unions and intersections, essentially asking, “What if we keep everything any method flags?” versus “What if we keep only the features all methods agree on?” Age brackets, self-rated general health, difficulty walking, history of stroke, diabetes, kidney disease, body mass index, and certain lifestyle markers repeatedly emerged as the most informative signals across methods.

Putting machine learning models head-to-head

With balanced data and carefully chosen features, the team compared seven popular machine learning approaches: logistic regression, decision trees, random forests, naïve Bayes, support vector machines, artificial neural networks, and k-nearest neighbors. They judged them using common measures: overall accuracy, how often positive predictions were correct (precision), how many true heart disease cases were caught (recall), and how well models separated diseased from non-diseased people across all thresholds (the ROC–AUC score). Random forests and decision trees consistently rose to the top once feature selection was applied, especially when ANOVA-based methods were part of the selection process. In the best setting, a random forest reached about 92% accuracy, 93% recall, and an AUC of 0.92, clearly ahead of its competitors.

When combining models helps—and when it doesn’t

The authors also explored “bagging,” a way of creating many slightly different versions of a model and then combining their votes. This ensemble trick is often used to reduce instability in models like decision trees. In this study, bagging brought small gains for a few high-variance models but did not dramatically improve their ability to distinguish heart disease from healthy cases, especially when used without the careful feature selection described above. In fact, relying on bagging alone sometimes left important positive cases undetected, which would be unacceptable in a medical setting.

What this means for patients and doctors

To a layperson, the key message is that how we prepare and shape the data can matter more than how fancy the prediction model is. Simply throwing a complex algorithm at uneven, noisy health records is not enough. This work shows that balancing the data and carefully choosing a focused set of meaningful risk factors—especially those highlighted by ANOVA-based methods—allows relatively straightforward models like random forests and decision trees to make much more reliable heart disease predictions. While these results still need to be confirmed on other populations and in real clinics, they point toward practical recipes for building early-warning tools that may one day help doctors spot at-risk patients earlier and tailor prevention efforts more effectively.

Citation: Ara, J., Bhuiyan, H., Roza, I.I. et al. Importance of balanced datasets with feature selection and ensemble methods on heart disease classification using distinctive machine learning techniques: a comparative analysis. Sci Rep 16, 11706 (2026). https://doi.org/10.1038/s41598-026-47691-4

Keywords: heart disease prediction, machine learning, feature selection, health data balancing, random forest models