Content area

Abstract

Portable Document Format (PDF) files are widely used for information exchange but have become a frequent vector for cyberattacks. Traditional signature-based and heuristic methods often fail against obfuscation and polymorphic malware, highlighting the need for more adaptive detection strategies. This study addresses the problem of PDF malware detection by applying machine learning, focusing on ensemble methods. A Random Forest model was trained on the PDFMal-2022 dataset using both static features (file size, page count, text length, image and JavaScript markers) and engineered features (text-to-size ratio, images-per-page ratio, missing text flag, and enhanced JavaScript count). Stratified cross-validation demonstrated stable performance with a macro F1-score of approximately 0.992. Feature importance analysis further confirmed the dominance of JavaScript-related attributes. The contribution of this work is to demonstrate that a lightweight and interpretable Random Forest framework can deliver state-ofthe-art detection while avoiding the computational demands of deep learning.

Full text

Turn on search term navigation

© 2025. This work is published under https://creativecommons.org/licenses/by-sa/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.