Clear Sky Science · en
Feature reduction using swarm optimization and random forest classifiers for early diabetes risk prediction
Why catching diabetes early matters
Type 2 diabetes often creeps in quietly, damaging the heart, eyes, kidneys, and nerves long before it is diagnosed. Doctors usually rely on many questions and tests to assess someone’s risk, which can be time‑consuming for both patients and clinics. This study explores how smart computer programs can flag early diabetes risk using just a handful of simple yes‑or‑no questions, potentially making screening faster, cheaper, and easier to deploy in busy or low‑resource settings. 
A smarter checklist for diabetes risk
The researchers worked with a real‑world dataset from a diabetes hospital in Sylhet, Bangladesh. Each of the 520 people in the dataset was labeled as either having early‑stage diabetes or not. For every person, doctors had recorded age and 15 straightforward clinical signs and symptoms, such as frequent urination (polyuria), unusual thirst (polydipsia), sudden weight loss, itching, blurred vision, and obesity. Most of these entries were simple yes‑or‑no answers to a questionnaire, making the data similar to what a nurse or health worker could gather in minutes during a routine visit.
Teaching the computer to focus on what matters most
Instead of feeding all 16 pieces of information into a model by default, the team asked a key question: which of these features actually carry the most information about diabetes risk? To answer it, they combined a popular machine‑learning method called a random forest with three "swarm" search strategies inspired by animal behavior: a fox optimizer, a honey badger algorithm, and tuna swarm optimization. These swarms behave like digital hunters, roaming through many possible combinations of features and model settings to find those that give the best predictions with the fewest inputs. The system repeatedly split the data into training and testing portions, tuned its internal settings, and voted on which features and parameter values worked best across many runs.
How well the streamlined models performed
The resulting three models—named FOX_RF, HBA_RF, and TSO_RF—were all highly accurate. When trained and tested once on the full dataset, the tuna‑based model (TSO_RF) classified every person correctly, reaching 100% accuracy, precision, and recall. When the authors used a more demanding 10‑fold cross‑validation, which mimics testing on unseen data, TSO_RF still achieved an average accuracy above 98%, slightly better than the other two models and better than previously published techniques on the same dataset. Importantly, the honey‑badger‑based model reached solid performance while using only 10 of the 16 features, and the other models needed just 13 or 14. That reduction means fewer questions for patients and lighter computation for any future app or device. 
Peeking inside the black box
Modern prediction systems often work well but are hard to interpret. To address this, the researchers used an explainable‑AI method called SHAP to measure how much each feature nudged the model toward predicting diabetes or not for each individual. Across all three models, the same pattern emerged: frequent urination, excessive thirst, and gender consistently had the strongest influence on predictions, with sudden weight loss, muscle stiffness, irritability, and a few other signs playing supporting roles. The team also examined specific mistakes—cases where the models misclassified people—and showed that small changes in these key symptoms often flipped the decision, revealing where the models are most sensitive and where clinicians should be cautious.
What this means for everyday health care
In plain terms, the study shows that a carefully designed computer model can identify early diabetes risk very accurately using a short, symptom‑based checklist and a few demographic details. By trimming away less useful questions and highlighting the most telling signs—especially frequent urination, excessive thirst, and gender—the approach could underpin quick screening tools in clinics, community health programs, or even smartphone‑based systems. While the work still needs testing on larger and more diverse populations, it points toward a future where early diabetes warnings are both more precise and less burdensome for patients.
Citation: Sarker, P., Nahid, AA., Choi, K. et al. Feature reduction using swarm optimization and random forest classifiers for early diabetes risk prediction. Sci Rep 16, 14355 (2026). https://doi.org/10.1038/s41598-026-35984-7
Keywords: diabetes prediction, machine learning, feature selection, swarm optimization, early diagnosis