Full text

Turn on search term navigation

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

The increasing prevalence of malicious Microsoft Office documents poses a significant threat to cybersecurity. Conventional methods of detecting these malicious documents often rely on prior knowledge of the document or the exploitation method employed, thus enabling the use of signature-based or rule-based approaches. Given the accelerated pace of change in the threat landscape, these methods are unable to adapt effectively to the evolving environment. Existing machine learning approaches are capable of identifying sophisticated features that enable the prediction of a file’s nature, achieving sufficient results on existing samples. However, they are seldom adequately prepared for the detection of new, advanced malware techniques. This paper proposes a novel approach to detecting malicious Microsoft Office documents by leveraging the power of large language models (LLMs). The method involves extracting textual content from Office documents and utilising advanced natural language processing techniques provided by LLMs to analyse the documents for potentially malicious indicators. As a supplementary tool to contemporary antivirus software, it is currently able to assist in the analysis of malicious Microsoft Office documents by identifying and summarising potentially malicious indicators with a foundation in evidence, which may prove to be more effective with advancing technology and soon to surpass tailored machine learning algorithms, even without the utilisation of signatures and detection rules. As such, it is not limited to Office Open XML documents, but can be applied to any maliciously exploitable file format. The extensive knowledge base and rapid analytical abilities of a large language model enable not only the assessment of extracted evidence but also the contextualisation and referencing of information to support the final decision. We demonstrate that Claude 3.5 Sonnet by Anthropic, provided with a substantial quantity of raw data, equivalent to several hundred pages, can identify individual malicious indicators within an average of five to nine seconds and generate a comprehensive static analysis report, with an average cost of USD 0.19 per request and an F1-score of 0.929.

Details

Title
Detection of Malicious Office Open Documents (OOXML) Using Large Language Models: A Static Analysis Approach
Author
Heß, Jonas  VIAFID ORCID Logo  ; Graffi Kalman  VIAFID ORCID Logo 
First page
32
Publication year
2025
Publication date
2025
Publisher
MDPI AG
ISSN
2624800X
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3223910832
Copyright
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.