Clear Sky Science · en

BERT-spaCy hybrid NLP and blockchain-enhanced adaptive CTI for IOC extraction and threat prediction

· Back to index

Why smarter cyber defenses matter

Everyday life now depends on digital systems—from hospital records and online banking to smart homes and industrial robots. Yet cyber-attacks are growing faster and more sophisticated than many defenses can handle. This paper presents an advanced but practical approach to cyber threat intelligence that aims to spot attacks earlier, learn from new incidents automatically, and let organizations share warning signs safely without fear of tampering.

Figure 1
Figure 1.

Turning messy clues into clear warning signs

Modern attacks leave scattered traces across emails, security logs, social media posts, and technical reports. These traces, known as indicators of compromise, include suspicious web addresses, IP numbers, malware names, and file fingerprints. The authors build a hybrid text-analysis engine that combines three techniques: hand-crafted patterns for highly structured items, a fast language toolkit (spaCy) for general text handling, and a powerful deep-learning model (BERT) to understand context. Working together, these tools can pull useful threat clues out of unstructured writing with around 95% accuracy, even when the language is noisy or informal.

Teaching machines to recognize and adapt to attacks

Extracted clues are not enough; the system must decide whether an event is likely benign or dangerous. To do this, the framework uses an ensemble of machine-learning models, including BERT, a recurrent network (LSTM), and a simpler probabilistic method. Each model brings different strengths—deep context, sequence understanding, or robustness on small samples—and their opinions are combined in a confidence-weighted vote. The system is designed to keep learning: when new labeled examples arrive, it updates its internal parameters without starting from scratch. Over a year of simulated operation, this adaptive approach lifts detection accuracy from 75% to 93% and cuts false alarms, especially in skewed data where genuine attacks are rare.

Locking in trust with an unchangeable record

A persistent problem in cyber defense is trust: organizations may hesitate to share threat information if they fear it could be altered, misused, or challenged later. To address this, the framework adds a lightweight, blockchain-inspired ledger. Each processed report—its extracted clues, system verdict, and time of observation—is sealed into a cryptographic block linked to the previous one, creating an audit trail that is extremely hard to rewrite silently. In tests, deliberate tampering in the chain is reliably detected. Because the design is streamlined and runs on a single node, it adds only a few milliseconds per entry, keeping the system fast enough for busy security operations centers.

Figure 2
Figure 2.

Testing reliability across different digital worlds

Cyber defenses often perform well on one dataset but stumble when the environment changes. The authors therefore test their system on two widely used collections of network traffic, which differ in attack types and patterns. They introduce a “cross-dataset robustness index” to measure how consistently a model performs when moved between datasets. The BERT-based component scores almost perfectly on this scale, slightly outperforming LSTM and clearly beating more traditional methods. Detailed statistical checks, including extensive simulations and effect-size analysis, show that these gains are unlikely to be due to chance and remain stable under noisy, uneven conditions.

What this means for everyday security

Put simply, this work shows how to turn scattered human-written reports and raw network traces into a live, trustworthy early-warning system. By combining advanced language understanding, adaptive learning, and a tamper-evident ledger, the framework spots threats more accurately, responds faster—reducing processing time per batch of reports by about half—and preserves a reliable history of what was seen and decided. For banks, hospitals, industrial sites, and internet-of-things environments, such a system could provide a shared, transparent backbone for cyber defense—one that keeps improving as new attacks emerge, instead of waiting for static rule sets to catch up.

Citation: Mishra, S., Alfahidah, R.A. & Alharbi, F. BERT-spaCy hybrid NLP and blockchain-enhanced adaptive CTI for IOC extraction and threat prediction. Sci Rep 16, 8147 (2026). https://doi.org/10.1038/s41598-025-34505-2

Keywords: cyber threat intelligence, malware detection, blockchain security, machine learning, network intrusion