Content area
Abstract
Automated techniques for detecting software vulnerabilities are necessary for developing secure systems. While deep learning approaches have been designed to address these issues, many focus solely on source and binary versions of code, ignoring intermediate representations. Models exist that evaluate images created from code, yet they fail to provide multiclass classification of vulnerabilities, which is necessary for developers to address specific insecurities. This research seeks to fill this gap by investigating deep learning approaches, performing classification of images generated from tokenized source code. To accomplish this, we performed a model-based formulative analysis of several models, comparing their accuracy using a PHP-based dataset of web-based vulnerabilities to suggest an optimal model for vulnerability detection. The research resulted in a process for creating images from PHP tokens and a ConvNext convolutional neural network that operated on ‘extended’ grayscale images of tokenized PHP source code. Our model achieved macro F1 scores of 0.958 and 0.962 in binary and multiclass classification, respectively; this approach outperformed existing models operating on the tokenized code of this same dataset. Ultimately, these results provide significant insight into novel approaches for future vulnerability detection.