Clear Sky Science · en

Efficient detection of intrusions in TON-IoT dataset using hybrid feature selection approach

2026-02-07 · Back to index

Why protecting smart devices matters

Billions of everyday gadgets—from home cameras to factory sensors—now talk to each other over the internet, forming what we call the Internet of Things (IoT). While this connectivity brings convenience and efficiency, it also opens new doors for hackers. The paper summarized here tackles a simple but crucial question: how can we reliably spot attacks in these sprawling device networks without needing heavy, power-hungry security software?

The challenge of spotting digital break-ins

To study attacks on IoT systems, researchers often rely on large, public datasets that record how network traffic looks during both normal operation and cyberattacks. One of the most widely used is the ToN-IoT dataset, which captures real traffic from a realistic industrial testbed, including many kinds of attacks such as denial of service, ransomware, password cracking, and man-in-the-middle spying. However, the authors show that this dataset has a hidden pitfall: many attacks were launched from fixed ranges of IP addresses and port numbers. That means a model can “cheat” by learning who the attacker is, instead of what malicious behavior looks like. Such models can score very high in the lab but fail badly when an attacker comes from a new address.

From bulky data to a lean view of behavior

The original ToN-IoT network data includes 44 different measurements for every connection, ranging from IP information to details of web and encrypted traffic. Handling all of them increases computing time and memory needs, which is a problem for small IoT gateways and edge devices. The authors first use their understanding of how attacks work to strip out features that are either biased (such as IP addresses and port numbers) or not very helpful for distinguishing attacks. They argue that most IoT threats ultimately show up as strange patterns in how many packets and bytes are sent, received, and how long connections last—regardless of who is talking to whom. This first stage shrinks the feature set from 44 down to seven core traffic statistics related to volume and duration.

Hybrid feature selection: three lenses on the same data

Next, the team applies three different “wrapper” methods that repeatedly train a model while adding, removing, or recombining features to see which subset really matters most. Forward selection builds up from an empty set, keeping a feature only if it boosts accuracy. Backward elimination starts from all seven and removes features that do not hurt accuracy when dropped. A genetic algorithm explores many combinations in parallel, evolving better subsets over generations. All three are tested using a simple decision tree classifier, with accuracy as the yardstick. By intersecting the results, the authors arrive at a stable core of five features: connection duration, bytes sent, bytes received, and their corresponding IP-level byte counts. These five variables effectively capture abnormal surges or imbalances in traffic that signal many different attack types.

Lightweight models that still perform strongly

With this slimmed-down, behavior-focused dataset, the researchers evaluate how well straightforward machine learning models can tell safe traffic from attacks. Using only the five chosen features, a decision tree reaches 98.6% accuracy for basic “attack vs normal” classification and 97.2% accuracy when distinguishing among multiple attack categories. A k-nearest neighbor model performs similarly, and more complex ensemble methods like random forests or gradient boosting offer only tiny gains while demanding more computation and memory. Crucially, the authors confirm through statistical tests that their chosen features are genuinely informative, rather than artifacts of the way the data was collected. They do note that subtle man-in-the-middle attacks—designed to blend in with normal flows—remain harder to detect, hinting that future work may need richer protocol or timing cues for these cases.

What this means for real-world security

For non-specialists, the key takeaway is that you do not always need massive models or dozens of technical measurements to protect IoT systems. By stripping out clues that only work in one lab setup, and focusing instead on a handful of traffic behaviors, the authors show that simple, fast algorithms can still catch most attacks with high reliability. Their five-feature version of the ToN-IoT dataset is easier to process on constrained devices at the edge of the network, making it practical for routers, gateways, and small hubs that must react to threats in real time. In short, the study suggests a path toward more trustworthy and deployable intrusion detection for the everyday smart devices that increasingly surround us.

Citation: Dharini, N., Janani, V.S. & Katiravan, J. Efficient detection of intrusions in TON-IoT dataset using hybrid feature selection approach. Sci Rep 16, 7763 (2026). https://doi.org/10.1038/s41598-026-37834-y

Keywords: IoT security, intrusion detection, machine learning, feature selection, network traffic