Content area
As cyberattacks continue to rise in frequency and sophistication, extracting actionable Cyber Threat Intelligence (CTI) from diverse online sources has become critical for proactive threat detection and defense. However, accurately identifying complex entities from lengthy and heterogeneous threat reports remains challenging due to long-range dependencies and domain-specific terminology. To address this, we propose XLNet-CRF, a hybrid framework that combines permutation-based language modeling with structured prediction using Conditional Random Fields (CRF) to enhance Named Entity Recognition (NER) in cybersecurity contexts. XLNet-CRF directly addresses key challenges in CTI-NER by modeling bidirectional dependencies and capturing non-contiguous semantic patterns more effectively than traditional approaches. Comprehensive evaluations on two benchmark cybersecurity corpora validate the efficacy of our approach. On the CTI-Reports dataset, XLNet-CRF achieves a precision of 97.41% and an F1-score of 97.43%; on MalwareTextDB, it attains a precision of 85.33% and an F1-score of 88.65%—significantly surpassing strong BERT-based baselines in both accuracy and robustness.
Details
Semantics;
Word sense disambiguation;
Accuracy;
Deep learning;
Conditional random fields;
Ontology;
Modelling;
Recognition;
Terminology;
Neural networks;
Natural language processing;
Labeling;
Cybersecurity;
Adaptation;
Threat evaluation;
Malware;
Permutations;
Language modeling;
Intelligence gathering;
Threats;
Robustness;
Sophistication;
Efficacy;
Acknowledgment;
Bidirectionality;
Intelligence
1 School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China
2 Shandong Key Laboratory of Industrial Network Security, Weihai 264209, China, Harbin Institute of Technology (Weihai) Qingdao Research Institute, Qingdao 266000, China
3 School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China, Weihai Cyberguard Technologies Co. Ltd., Weihai 264209, China