Clear Sky Science · en
Hierarchical malware detection, family identification, and variant attribution using CNN-based hybrid models on grayscale executable images
Why this matters for everyday computer users
Malicious software no longer arrives as a few easily recognized viruses. Today, attackers rapidly churn out countless look‑alike programs that slip past traditional antivirus tools. This study shows that by turning programs into simple black‑and‑white pictures and reading them with modern image‑recognition networks, a computer can not only spot malware with near‑perfect reliability, but also sort it into families and even specific strains. That level of detail helps defenders understand what an attack is trying to do, where it came from, and how to stop it.
From program bytes to gray pictures
The authors focus on Windows executable files, the kind of programs that commonly spread malware on laptops, desktops, and servers. Instead of dissecting each file by hand or running it in a controlled lab, they read its raw bytes straight through and map each byte to a pixel in a grayscale image. The result is a 224×224 black‑and‑white picture whose textures and blocks reflect structure inside the file: code regions, padding, encrypted payloads, and more. Every file in their dataset is treated this way, whether it is harmless software or one of 33 distinct malware variants spanning five broad families such as ransomware and spyware. 
One model, three answers at once
On top of these images, the team builds a deep‑learning system that works like an experienced customs officer. With a single glance at an incoming picture, it answers three questions at once: Is this file benign or malicious? If malicious, which broad family does it belong to? And which specific variant best describes it? The core of the system is a convolutional network, the same kind of architecture used for everyday image recognition. This shared backbone learns general visual features from the grayscale pictures. Above it sit three parallel output branches that specialize in the three decision levels, so the system can learn how coarse and fine‑grained patterns relate to each other instead of treating each task separately.
Three ways to read hidden structure
To probe what design works best, the authors test three “hybrid” versions of the model. In one, a temporal convolution head treats the flattened image like a sequence and uses dilated filters to connect distant regions, capturing long‑range patterns scattered across the file. A second version adds a capsule‑based head that keeps track of how small parts combine into larger structures, aiming to distinguish closely related variants that share many components. The third version uses a bidirectional sequence layer that reads the image both left‑to‑right and right‑to‑left, mimicking how context on either side of a region can change its meaning. All three are trained on exactly the same balanced dataset, with equal representation of each malware variant and of benign files, to ensure that performance differences reflect architecture rather than data quirks. 
How well does it work?
Across more than 3,000 held‑out test images, the hybrids perform strikingly well. For the simplest question—“malicious or not?”—two of the three reach a flawless 100% accuracy, and the third misses only a handful of benign files, erring on the side of caution. When asked to name the broader family, accuracy remains very high at 97–98%, with only occasional confusion between behaviorally similar groups such as spyware and trojans. The toughest test is to name the exact variant among 33 options. Even here, all three models reach 93–94% accuracy using nothing but grayscale images, and detailed score breakdowns show that most variants are recognized with very high reliability. One design, pairing the convolutional backbone with temporal convolutions, offers the most balanced performance across all variants.
What this means for digital investigations
For security teams and forensic analysts, these results are more than an academic benchmark. In a real incident, thousands of suspicious programs might be collected from infected machines. Running full behavioral analysis on each one is slow and resource‑intensive. The proposed image‑based system can quickly filter out harmless files, group the rest by family, and pinpoint likely variants in a single pass, all without executing them. That makes it a powerful triage tool: investigators can focus their most expensive tools on the most important samples while still gaining campaign‑level insight. The study demonstrates that simple gray pictures of program bytes, processed with carefully chosen neural‑network designs, are enough to support fine‑grained malware attribution that used to require far more elaborate and time‑consuming analysis.
Citation: Saxena, M., Das, T. Hierarchical malware detection, family identification, and variant attribution using CNN-based hybrid models on grayscale executable images. Sci Rep 16, 9948 (2026). https://doi.org/10.1038/s41598-026-40655-8
Keywords: malware detection, deep learning, grayscale images, CNN hybrid models, digital forensics