Static detection method for multi-level network source code vulnerabilities based on knowledge graph technology

Abstract

The current static detection method of network source code vulnerabilities mainly relies on the static analysis of binary code. However, due to the failure to fully simulate the actual operating environment of programs, some vulnerabilities that trigger only under specific conditions are difficult to be found by static detection tools. This limitation increases the difficulty of static analysis. Therefore, a static detection method for multilevel network source code vulnerability based on knowledge graph technology is proposed. Web crawler technology is selected to collect and preprocess vulnerability data to avoid interference from network environment and malicious programs, which leads to redundancy and abnormal vulnerability data. By introducing knowledge graph information and combining word embedding with knowledge embedding, named entities are automatically identified from the preprocessed vulnerability data set. While adopting the joint embedding technology, we can integrate the word embedding and knowledge embedding more effectively, introducing an attention mechanism to enhance the weight of key information. Improve the effectiveness of the named entity identification. The identified named entities are taken as the basic nodes in the knowledge graph to build the multi-level network source code vulnerability knowledge graph, calculate the vulnerability attack error and attack loss, and quantitatively evaluate the accuracy of detection and the potential harm of vulnerabilities. The experimental results show that the proposed method can accurately detect the identification of named entities and vulnerabilities, and has certain positive significance to ensure the network security.

Full text

Translate

Turn on search term navigation

Introduction

With the continuous increase in network size and complexity, network security is facing unprecedented challenges, and source code vulnerabilities are an important factor leading to network security issues [1]. On the one hand, the interactions and dependencies between different levels in a multi-level network are complex, and traditional vulnerability detection methods often struggle to comprehensively and accurately detect vulnerabilities hidden in source code. For example, in a system that includes interactions between the application, transport, and network layers, a single level of detection may overlook vulnerabilities that arise from cross layer interactions [2, 3]. On the other hand, new programming techniques and frameworks continue to emerge, making the structure and logic of source code more complex and diverse and increasing the possibility of vulnerabilities. Attackers are increasingly adept at exploiting these vulnerabilities for malicious attacks, such as stealing sensitive information and disrupting system operations [4]. Therefore, research on static detection methods for multi-level network source code vulnerabilities is of great significance. It helps improve the level of network security, protect important information, and ensure the stable operation of systems, and reduce various risks caused by the exploitation of source code vulnerabilities.

To improve the security and stability of software systems, researchers are committed to developing static detection methods for network source code vulnerabilities to more efficiently and accurately identify and repair potential security vulnerabilities and thus build a more secure and reliable digital environment. For example, reference [5] proposes a source code vulnerability detection method based on a feature dependency graph, which extracts candidate vulnerability statements from function slices and generates a feature dependency graph by analyzing the control and data dependency chains of the candidate vulnerability statements. Using a word vector model to generate initial node representation vectors for feature dependency graphs, a vulnerability detection neural network oriented towards feature dependency graphs is constructed. The graph learning network learns the heterogeneous neighbor node information of the feature dependency graph, and the detection network extracts the global features for vulnerability detection. The experimental results show that this method can fully explore the local and global features in the feature dependency graph. However, it has a strong dependence on the data. Deviations or incompleteness in the training data may cause the neural network to learn incorrect patterns, thereby affecting the accuracy of vulnerability detection. Reference [6] proposed an automated vulnerability detection method based on the Relationship Graph Convolutional Network (RGCN), which converts program source code into CPG containing syntax and semantic feature information, uses RGCN to learn the representation of the graph structure, and trains a neural network model to predict vulnerabilities in program source code. The experimental results show that this method can effectively integrate multiple key pieces of information from the source code, making the subsequent vulnerability detection results more comprehensive. However, the RGCN model is relatively complex, and the training process requires a large amount of computing resources and a long time, making it less practical. Reference [7] proposes a vulnerability detection method for Python source code, which uses named entity recognition to classify different types of vulnerabilities and fine tunes downstream tasks of token classification for different common weakness enumeration standards, thus achieving vulnerability detection. The experimental results show that this method has the advantage of strong targeting. However, owing to the complexity and diversity of the code, there is a possibility of false positives and false negatives. Reference [8] introduces a detection framework consisting of dataset generation, tools, and machine learning methods for detecting source code vulnerabilities. Machine learning methods have powerful learning and adaptive capabilities, and can mine hidden patterns and patterns from large amounts of data. However, the generated data may not fully represent all vulnerability situations in reality, and there is a risk of data bias, which may lead to inaccurate detection results.

Given the numerous issues with existing methods, a multi-level network source code vulnerability static detection method based on knowledge graph technology is proposed to provide more reliable protection for multi-level network security, reduce risks such as data leakage and system paralysis caused by the exploitation of source code vulnerabilities, and adapt to increasingly complex network security environment requirements.

Design of static detection method for multi level network source code vulnerabilities

Vulnerability data collection and preprocessing

Vulnerability data collection and preprocessing uses intelligent crawler technology to extract the required vulnerability information from vulnerability data from different sources and integrate it into structured vulnerability knowledge, which serves as the data support for the vulnerability knowledge graph.

In general, vulnerability data mainly exist in unstructured text forms, such as NVD, CVE, and other vulnerability databases. The types of data stored in each vulnerability database have significant differences, making the presentation and storage location of vulnerability data relatively random, which causes great difficulties in vulnerability data collection. Based on the characteristics of the vulnerability data mentioned above, this study selects web crawler technology to collect the vulnerability data, as shown in Fig. 1.

[See PDF for image]

Fig. 1

Flow Chart of vulnerability data collection

Collect vulnerability data using the process shown in Fig. 1 and integrate it into a collection form, denoted as $X = \{x_{1}, x_{2}, . . ., x_{n}\}$ , where $n$ represents the total number of vulnerability data.

Owing to the susceptibility of web crawlers to interference from network environments, malicious programs, and other factors during vulnerability data collection, there may be redundancy, hierarchical logic confusion, and anomalies in vulnerability data, which are not conducive to the construction of a multi-level network source code vulnerability knowledge graph. Therefore, before constructing a vulnerability knowledge graph, preprocessing of vulnerability data is necessary. Calculate the similarity between any two data points in the vulnerability dataset, expressed as:

S (x_{i}, x_{j}) = \frac{x_{i} \cap x_{j}}{x_{i} \cup x_{j}} \times η

In the formula, $S (x_{i}, x_{j})$ represents the similarity between vulnerability data $x_{i}$ and $x_{j}$ ; $x_{i} \cap x_{j}$ represents the intersection of vulnerability data $x_{i}$ and $x_{j}$ ; $x_{i} \cup x_{j}$ represents the union of vulnerability data $x_{i}$ and $x_{j}$ ; $η$ an auxiliary parameter representing the similarity, determined by calculating the ratio of the size of the intersection of the two sets to the union size.

Based on the calculation result of formula (1), we determine whether the vulnerability data are redundant. The determination rule is: when $S (x_{i}, x_{j}) = 1$ , it indicates that $x_{i}$ and $x_{j}$ are redundant data and one of them needs to be deleted; When $S (x_{i}, x_{j}) \neq 1$ , it indicates that $x_{i}$ and $x_{j}$ are normal data, and two data are retained.

The detection of vulnerability anomaly data is also a key part of pre-processing. The calculation formula for the vulnerability anomaly data detection factor is:

A_{e} = \frac{x_{i} - \bar{x}}{σ}

In the formula, $A_{e}$ represents the vulnerability anomaly data detection factor; $\bar{x}$ represents the average value of vulnerability data; $σ$ represents the standard deviation of vulnerability data. Based on the calculation result of formula (2), we determine whether the vulnerability data are abnormal. The determination rule is: when $|A_{e}| > 1$ , it indicates that $x_{i}$ is abnormal data; When $|A_{e}| \leq 1$ , it indicates that $x_{i}$ is normal data.

Delete the redundant and abnormal data detected, and reorganize the remaining data to obtain the preprocessed vulnerability data set, denoted as $X^{'} = \{x_{1}', x_{2}', . . ., x_{m}'\}$ , where $m$ represents the total number of preprocessed vulnerability data.

The above process completes the collection and preprocessing of vulnerability data, sorting out the raw and chaotic vulnerability data, removing erroneous data, etc., to achieve a certain quality standard, and obtains the final set of vulnerability data, laying a solid foundation for the subsequent entity recognition and construction of the vulnerability knowledge graph.

Named entity recognition

Next, named entities will be automatically identified from the preprocessed vulnerability dataset $X^{'} = \{x_{1}', x_{2}', . . ., x_{m}'\}$ . Named entity recognition can classify and label key information in vulnerability data [9, 10], allowing for more targeted analysis of different types of entities and their relationships in subsequent vulnerability detection processes, improving the accuracy and efficiency of vulnerability detection, and better focusing on key parts related to vulnerabilities.

Traditional BERT model is well at mining semantic information in data, but it considers less about combining knowledge graph to enhance capabilities when handling named entity recognition tasks. As a resource rich in structured information, knowledge graph is crucial to improving the language comprehension ability of models. By introducing knowledge graph information and combining it with word embeddings, we implement joint embeddings, a strategy that brings unique advantages. The unique advantage of combining word embedding and knowledge embedding lies in its ability to synthesize the lexical information on the surface of the text with deep semantic relationships to generate a more comprehensive and accurate semantic representation. Word embedding captures the co-occurrence relationship and context between words, while knowledge embedding reveals the internal connections and attribute characteristics among entities. This integration not only enriches the semantic information database of the model, but also enhances the ability to understand complex semantic relationships. The performance improvement of named entity recognition can be justified to prove that this integration is necessary. By introducing knowledge graph information and combining with word embedding, the recognition performance of named entities is significantly improved. This integration strategy is particularly important when dealing with multi-level, high-complexity tasks, such as network source code vulnerability detection. It can accurately identify the key entities in the source code, providing a more solid foundation for vulnerability detection, so as to effectively improve the accuracy and efficiency of vulnerability detection. Therefore, combining word embedding and knowledge embedding is a necessary and effective strategy to improve the performance of natural language processing models and enhance the semantic comprehension ability.

Set the term sequence marker as $\{s_{1}, s_{2}, . . ., s_{N}\}$ , where $N$ is the length of the term sequence, and set the entity sequence marker aligned with the term sequence as $\{d_{1}, d_{2}, . . ., d_{M}\}$ , where $M$ represents the length of the entity sequence. In addition, the entire vocabulary containing all tags is denoted as $S$ , and the entity sequence table containing all entities is denoted as $D$ . In named entity recognition, on the one hand, the T-Encoder is used to capture basic lexical and syntactic information from text [11], whereas the K-Encoder is used to integrate the knowledge graph into the underlying text information, ultimately representing the heterogeneous information of vocabulary and entities as a unified feature space.

In the T encoder, the three information of word embedding, sentence embedding and position embedding are first combined into the input of the T encoder, which together constitute a comprehensive representation of the input text. The function of T encoder is to use these comprehensive information to dig deep into the correlation between words and the semantic relations between sentences through its internal mechanism. Subsequently, formula (3) is used to calculate vocabulary and semantic features that accurately capture key information and contextual context in the text. In this way, the T encoder is able to generate a richer and more accurate text representation, providing strong support for subsequent natural language processing tasks.

\{s_{1}, s_{2}, . . ., s_{n}\} = T - E n c o d e r \{s_{1}, s_{2}, . . ., s_{n}\}

In the K-encoder, the corresponding entities in the text are first identified and extracted, in order to obtain the key information units in the text. Then, the knowledge graph embedding method is used to transform these entities into corresponding vector representations, which can accurately capture the semantic correlation and attribute characteristics of the entities in the knowledge graph. The main function of K-encoder is that it can integrate the entity information in the text into the model in a structured form, thus significantly enhancing the model's ability to interpret the text semantics. In this way, the K-encoder provides richer and accurate semantic information for subsequent natural language processing tasks, which improves the performance of tasks, as shown in Fig. 2.

[See PDF for image]

Fig. 2

Working principle of K-Encoder

Then use $\{s_{1}, s_{2}, . . ., s_{n}\}$ and $\{d_{1}, d_{2}, . . ., d_{m}\}$ as inputs to the K-Encoder, i.e.:

\{s_{1}, s_{2}, . . ., s_{n}\}, \{d_{1}, d_{2}, . . ., d_{m}\} = K - E n c o d e r \{s_{1}, s_{2}, . . ., s_{n}\}, \{d_{1}, d_{2}, . . ., d_{m}\}

The output results $\{s_{1}^{o}, s_{2}^{o}, . . ., s_{n}^{o}\}$ and $\{d_{1}^{o}, d_{2}^{o}, . . ., d_{n}^{o}\}$ will serve as features for the relevant tasks. Specifically, the K-Encoder consists of stacked aggregators designed to encode terms and entities and fuse their heterogeneous features. In the relevant aggregator $i$ , word embedding and entity embedding are represented through a multihead self-attention mechanism, namely:

\{s_{1}^{o}, s_{2}^{o}, \dots, s_{n}^{o}\} = M H - A H T \{s_{1}^{(i - 1)}, s_{2}^{(i - 1)}, \dots, s_{n}^{(i - 1)}\}

\{d_{1}^{o}, d_{2}^{o}, \dots, d_{n}^{o}\} = M H - A H T \{d_{1}^{(i - 1)}, d_{2}^{(i - 1)}, \dots, d_{n}^{(i - 1)}\}

The aggregator achieves the fusion of word and entity sequences through an information fusion layer, and calculates the output embeddings for each word and entity. For term $s_{j}$ and its aligned entity $d_{k} = f (s_{j})$ , the information fusion process is as follows:

h_{j} = μ (W_{t}^{i} w_{j}^{i} + W_{d}^{i} w_{k}^{i} + b_{i}^{o})

s_{j}^{i} = μ (W_{t}^{i} h_{j} + b_{t}^{o})

d_{j}^{i} = μ (W_{e}^{i} h_{j} + b_{e}^{o})

In the formula, $h_{j}$ represents the internal hidden state of the integrated word and entity information; $μ$ represents the GELU activation function. In this way, the knowledge information of entities is integrated into the enhanced representation of text semantics. For words without corresponding entities, the information fusion layer calculates the output embedding without integration, that is:

h_{j} = σ (W_{t}^{i} w_{j}^{i} + b_{i}^{o})

s_{j}^{i} = σ (W_{t}^{i} h_{j} + b_{t}^{o})

In addition, to simplify the named entity recognition process, the $i$ -layer aggregator operation is represented by Eq. (12):

\{s_{1}^{i}, s_{2}^{i}, . . ., s_{n}^{i}\}, \{d_{1}^{i}, d_{2}^{i}, . . ., d_{n}^{i}\} = A g g \{s_{1}^{i - 1}, s_{2}^{i - 1}, . . ., s_{n}^{i - 1}\}, \{d_{1}^{i - 1}, d_{2}^{i - 1}, . . ., d_{n}^{i - 1}\}

The output embeddings of terms and entities calculated by the top-level aggregator are used as the final output embeddings of K-Encoder to achieve multi-level network source code vulnerability named entity recognition.

Because of the ability of the attention mechanism to dynamically focus on the crucial parts of the source code text for named entity recognition, and the joint embedding technique ensures that heterogeneous information from the text and knowledge graph can complement and enhance each other in a unified feature space, it is beneficial for improving the model's recognition accuracy and generalization ability for complex named entities. Therefore, by introducing an attention mechanism to enhance the weight of key information and adopting joint embedding technology to more effectively integrate word embedding and knowledge embedding, the expression is:

E_{f} = α \cdot T (W) + β \cdot K (G, W) + γ \cdot A (T (W), K (G, W))

In the formula, $T (W)$ represents the text embedding processed by T-Encoder; $K (G, W)$ represents the knowledge graph embedding processed by K-Encoder; The $A (\cdot)$ function is used to calculate attention weights; $α, β, γ$ is an adjustable parameter used to ensure the effectiveness of information fusion.

Based on the above analysis, we can further explain the training of the T-encoder and K-encoder networks. Take the example of a dataset containing multi-level network source code, which contains a large number of source code samples containing potential vulnerabilities. In order to train the model, the source code text is first input into the BERT model and its variants T-Encoder, and T-Encoder obtains basic vocabulary and grammar information from the source code text to form a preliminary word embedding representation. This step provides the underlying semantic information of the source code for the model. Subsequently, a knowledge graph is introduced, which contains the relationships and properties between the entities in the source code. Using the K-encoder, the structured information in the knowledge graph is integrated into the underlying text information, which enhances the model's semantic understanding of the named entities in the source code. This step provides deeper semantic information for the model and facilitates the identification of complex named entities. Finally, we jointly embed the output of the T-and K-encoders and introduce attention mechanisms that dynamically adjust the weights of different information sources to form a unified, semantically rich representation of the feature space. This feature-space representation was used for the final named entity recognition task.

To verify the role of T-and K-encoder, ablation studies were performed. The models were trained using only T-encoder, K-encoder, and using T-encoder and K-encoder separately, and tested on the same dataset, and the test results are shown in Table 1.

Table 1. Contrast the results table

model	Vocabulary and grammar information processing	Named entity deep semantic relationship processing	Comprehensive competency assessment
only T-encoder	weak	Lack of ability to deal with complex semantic relationships	range
Only K-encoder	Relatively weak	A little less	range
using T-encoder and K-encoder separately	More outstanding	Ability to capture deep semantic relationships	outstanding

The experimental results show that models using only T-encoder or K-encoder still have limitations in performance. T-encoder performs well in processing vocabulary and grammar information, but slightly underthe deep semantic relationships of named entities; K-encoder captures the deep semantic relationships of named entities but is relatively weak in handling vocabulary and grammar information. When combining the T-encoder and the K-encoder, the model performance on the named entity recognition task was significantly improved. This indicates that T-coder and K-coder can complement each other and jointly improve the semantic understanding of the model.

In summary, it can be concluded that in multi-level network source code vulnerability named entity recognition, first, the BERT model and its variants (T-Encoder) are used to capture basic lexical and syntactic information from the source code text, forming preliminary word embedding representations; Next, by introducing a knowledge graph and utilizing K-Encoder to integrate the structured information in the graph into the underlying text information, the model's semantic understanding of named entities in the source code is enhanced; Finally, the outputs of T-Encoder and K-Encoder are jointly embedded, and attention mechanisms may be introduced to dynamically adjust the weights of different information sources to form a unified and semantically rich feature space representation for the final named entity recognition task. This process not only improves the recognition accuracy of the model for complex named entities in the source code but also provides a more solid information foundation for subsequent vulnerability detection.

Construction of vulnerability knowledge graph

The named entities identified above are used as the basic nodes in the knowledge graph to construct a multi-level network source code vulnerability knowledge graph. The traditional vulnerability database contains redundant and missing vulnerability information, and some vulnerability information is not clearly related, which cannot be well used for vulnerability analysis. Building a vulnerability knowledge graph based on vulnerability data can aggregate multi-source vulnerability information and clearly display the correlation between different types of vulnerability knowledge [12, 13]. Based on the vulnerability data collected and processed in Sect. 2.1, this study constructs a structured vulnerability knowledge graph and utilizes technologies such as multisource vulnerability knowledge fusion and automatic vulnerability knowledge updates to maintain the information accumulation and updating of the knowledge graph data.

In this study, the triplet is used as the representation of the vulnerability knowledge graph, defined as:

G = (Y, U, P)

In the formula, $Y$ represents the set of entities in the vulnerability knowledge graph, including vulnerabilities, software, vulnerability patches, and software dependencies; $U$ represents the set of relationships in the vulnerability knowledge graph, which is used to characterize the relationships between various entities, such as the relationships between vulnerability entities and software entities; $P$ represents the triplet set in the vulnerability knowledge graph, which includes basic forms such as (entity 1, attribute, attribute value) and (entity 1, relationship, entity 2).

A vulnerability knowledge graph mainly consists of two structures: vulnerability and software dependency entities. The two entities contain multiple sub-entity structures, and different entities are connected through different association relationships. The entity structure in the knowledge graph is defined as follows:

V_{G} = \{H_{G}, D_{E}\}

In the formula, $V_{G}$ represents the vulnerability knowledge graph constructed in this article; $H_{G}$ and $D_{E}$ respectively represent two major structures: vulnerability entity and software dependency entity. The vulnerability entity includes sub-entity structures such as vulnerability basic information, vulnerability type CWE, vulnerability level CVSS, vulnerability exploitation, affected software, affected software versions, and vulnerability patches. Software dependency entities include entity structures, such as software projects, project versions, project collections, project dependencies, and dependency versions.

The entity structure relationship of the vulnerability knowledge graph is shown in Fig. 3, where different entities are connected through different relationships and each entity contains different types of attributes.

[See PDF for image]

Fig. 3

Multi level network source code vulnerability knowledge graph

The constructed vulnerability knowledge graph can form a complex network containing vulnerability entities, software dependency entities, and their associated relationships by integrating multisource vulnerability data. This graph not only enriches the representation of vulnerability knowledge but also enhances its practicality. By utilizing entity relationships in the graph, potential vulnerabilities in source code can be analyzed more effectively, vulnerability types can be identified, vulnerability propagation paths can be traced, and the severity of vulnerabilities can be evaluated [14, 15]. At the same time, combined with the reasoning ability of graphs, possible attack patterns can be predicted to assist in detecting unknown vulnerabilities in the system, thereby achieving comprehensive and accurate detection of vulnerabilities in multi-level network source code.

Calculation of vulnerability attack error and attack loss

After constructing the vulnerability knowledge graph, to quantitatively evaluate the accuracy of detection and the potential harm of vulnerabilities, further calculations of vulnerability attack errors and attack losses are carried out. The calculation of the attack error and loss is the core step in selecting the attack paths. Attack error describes the reachability of all attack behaviors successfully implemented by the attacker in a certain attack path, whereas attack loss describes the input–output ratio of each attack path. For a single atomic attack, the attack error is set as the unique value that evaluates the accuracy factors of all interference attacks. Assuming that there are $a$ interference factors, each with a weight of 1, the attack error of this atomic attack is:

Q_{1} = \sum_{j = 1}^{a} c_{j} z_{j}

In the equation, $c_{j}$ and $z_{j}$ represent the actual location of the target node being attacked and the location of the target node being attacked by the attacker, respectively. Owing to differences in attack purposes and methods among different attackers, multiple attack results may occur. Therefore, it can be concluded that the same attack entity can exhibit multiple attack loss attributes. Assuming that there are $t$ attack loss attributes, the overall attack loss of this atomic attack is:

O_{1} = \sum_{j = 1}^{T} b_{j}

In the formula, $b_{j}$ is the sum of $j$ attack loss values. For an attack path with randomly existing $l$ nodes, assuming that the node numbers from the initial node to the target node are $1 \sim l$ , the overall attack error of this attack path is:

Q_{l} = exp (\sum_{j = 1}^{a} c_{j} z_{j})

The total attack loss of all attack paths is:

O_{l} = \prod_{j = 1}^{l} b_{j}

The impact of potential attacks on power grid security vulnerabilities can be thoroughly and comprehensively evaluated by accurately calculating the error and loss of vulnerability attacks. This evaluation process not only considers the direct consequences that may be caused by attack behavior, such as system crashes and data leaks, but also covers the accuracy of the detection system performance and potential vulnerability identification revealed by attack errors. Through quantitative analysis, the actual risks of exploiting vulnerabilities can be more intuitively understood, providing strong data support for the formulation and optimization of power grid security protection strategies and ensuring the stable operation and power supply safety of the power grid system.

Experimental testing and result analysis

To verify the effectiveness of the static detection method for multi-level network source code vulnerabilities based on knowledge graph technology, experimental research was conducted. In the experiment, Linux system was selected as the operating system to provide a stable and efficient running environment; Using Python as the primary programming language to support knowledge graph construction, data processing, and algorithm implementation; Selecting IDE as a development tool to provide convenience for code writing, debugging, and version control; Select TensorFlow deep learning framework for training and optimizing knowledge graph related models; Select Py2neo to interact with the graph database.

In the software environment above, a vulnerability data automated update engine is used to achieve regular updates and expansion of the vulnerability knowledge graph. The sizes of the vulnerability data are shown in Table 2.

Table 2. Vulnerability entity data scale

Physical nodes	Number
Loophole	135,694
Types of vulnerabilities	213
Vulnerability level	305
Vulnerability patch	10,456
Exploiting vulnerabilities	19,852

The vulnerability entity contains 135,694 vulnerability nodes, covering a variety of vulnerability data published on the Internet. These vulnerability data include 213 vulnerability types, 10,456 vulnerability patches and 19,852 vulnerability utilization data, which can be used for vulnerability analysis and research.

Verification of named entity recognition effectiveness

Test the identification effect of the knowledge graph method named entity multilevel network source code vulnerability, a 5 times cross validation for designing the following comparison experiment analysis: this experiment designed a set of comparison experiment, comparison method with traditional characteristic dependent graph method, relationship graph convolution network method and machine learning method, the three methods compared with this method, and the mean as a performance index, verify the performance of each method, the results as shown in Table 3.

Table 3. Named entity recognition results

Method	Total number of entities	Identify the number	The correct number	Accuracy/%	Recall	F value
Traditional feature-dependent graph methods	523	456	400	87.76	76.44	81.76
Relational graph convolution network approach	523	491	430	87.58	82.22	84.83
Machine learning method	523	502	445	88.64	84.97	86.77
The method of this paper	523	518	460	88.80	87.95	88.37

The experimental results showed that the feature-dependent graph method identified 456 entities, of which 400 were correct, with an accuracy rate of 87.76%, recall rate of 76.44%, and F-value of 81.76%, indicating relatively average performance. The relational graph convolutional network method identified more entities (491) with 430 correct ones, slightly reducing the accuracy to 87.58%. However, the recall rate increased to 82.22%, and the F-value increased to 84.83%, demonstrating the effectiveness of using convolution operations on graph-structured data. The machine learning method further increased the number of recognitions to 502, with a correct number of 445 and an accuracy rate of 88.64%. The recall rate and F-value also reached 84.97% and 86.77%, respectively, indicating the advantages of machine learning methods in complex feature extraction. However, the knowledge graph method proposed in this study achieved the best performance for all the above indicators, identifying 460 correct entities out of 518, with a high accuracy rate of 88.80%. The recall rate and F-value also reached 87.95% and 88.37%, respectively. This fully verifies the superiority and effectiveness of the knowledge graph method in the task of named entity recognition in the field of multi-layer network source code vulnerabilities. This result not only demonstrates the unique advantages of knowledge graph methods in entity relationship modeling and feature extraction but also provides strong support for subsequent research and applications.

Verification of vulnerability detection effectiveness

The vulnerability entity dataset constructed above was applied to vulnerability detection, and the performances of the feature dependency graph, relationship graph convolutional network, and knowledge graph were compared and analyzed with the false positive rate as the experimental indicator to verify the effectiveness of the proposed method for detecting different types of vulnerabilities. The results are shown in Fig. 4.

[See PDF for image]

Fig. 4

Test results of false alarm rates using different methods

From Fig. 4, it can be seen that in vulnerability detection, the false positive rate of detection results shows significant fluctuations, which may be due to the fact that different types of vulnerabilities have different characteristics, triggering conditions, and impact ranges, increasing the uncertainty of false negative rates. Compared with feature-dependent graph and relationship graph convolutional networks, the knowledge graph method has a lower dropout rate, which is less than 6%, and has higher stability, indicating that this method can more comprehensively identify vulnerabilities in the source code and reduce potential security risks that have not been detected. However, feature-dependent graph and relational graph convolutional network methods have significant limitations and may miss many important vulnerabilities, thereby increasing the security risk of the system. Therefore, when conducting static detection of multi-level network source-code vulnerabilities, the proposed knowledge graph detection method should be chosen to ensure the security and stability of the source code.

To validate the effectiveness of the proposed knowledge graph method further, experiments were conducted on a real vulnerability firmware function dataset. The vulnerability firmware mainly comes from CWE's well-known vulnerability public website, which retrieves specific vulnerability firmware and performs network crawling on it. Examples include cross site scripting (XSS) vulnerability, input validation improper vulnerability (CWE-20), OS command injection vulnerability (CWE-78), and out of bounds write vulnerability (CWE-787). These vulnerabilities affect this firmware, as well as other versions and even a series of firmware. Therefore, crawling firmware files are affected by these vulnerabilities in the experimental analysis. Under these conditions, the detection performance of the feature dependency maps, relationship graph convolutional networks, and knowledge graphs was validated, and the results are shown in Fig. 5.

[See PDF for image]

Fig. 5

Comparison results of confusion matrices for different methods

The experimental results show that all three vulnerability detection methods exhibit high detection accuracy, with the proposed knowledge graph method performing particularly well, with the highest accuracy and smallest misclassification rate, indicating that it may have better performance in vulnerability detection. This means that the method can more accurately identify potential vulnerabilities in the code while reducing errors in mistaking non-vulnerabilities for vulnerabilities, thereby effectively improving the efficiency and reliability of vulnerability detection and providing stronger guarantees for code security. This result not only further proves the feasibility and reliability of the proposed method in practical applications but also provides new ideas and methods for vulnerability mining, firmware security analysis, and other fields.

A false alarm refers to the detection that the displayed results are not true vulnerabilities, and the false alarm rate represents the probability of such false alarm situations. Use $D$ to represent the total number of detected vulnerabilities in the firmware web, $M$ to represent the number of non vulnerabilities detected, and $P$ to represent the false positive rate of vulnerability detection:

P = (M / D) \times 100 %

The false positive rates of the feature dependency graph, relationship graph convolutional network, and proposed knowledge graph method are compared. Firmware is randomly selected from the well-known vulnerability public website of CWE and calculates the false positive rates for three types of vulnerabilities: XSS, command execution, and hard-coded passwords. The results are shown in Table 4.

Table 4. Comparison results of false alarm rates of different methods

Method	XSS	Command execution	Hard coded password	Misreported number	P/%
Feature dependency graph	6	1	12	5	26.3
Relationship graph convolutional network	12	2	0	3	21.4
Machine learning	6	1	9	1	6.2

From the results in Table 4, it can be observed that the proposed knowledge graph method has a lower false alarm rate than the feature-dependent graph and relationship graph convolutional network methods. A low false-positive rate means that this method can more accurately filter out non-vulnerability situations when identifying code vulnerabilities, reducing unnecessary warnings and interference, thereby improving the accuracy and credibility of vulnerability detection. This is crucial for developers and security analysts because they can rely more on the results provided by this method to locate and fix real vulnerabilities in the code more efficiently.

Conclusion

To identify and quantify potential security risks in advance and provide strong support for software security development and maintenance, a multi-level network source code vulnerability static detection method based on knowledge graph technology is proposed. The main achievements of this study are as follows:

We use web crawler technology to automatically collect vulnerability data and ensure the accuracy and reliability of the data through preprocessing steps, such as redundancy removal and anomaly detection, avoiding interference from external factors, such as network environment and malicious programs.
Introduce a joint embedding method for word embedding and knowledge embedding to automatically identify named entities from preprocessed vulnerability datasets. The accuracy and efficiency of NER have been improved by strengthening the weight of key information through the attention mechanism.
A multi-level network source code vulnerability knowledge graph was constructed using the identified named entities as basic nodes in the knowledge graph. This knowledge graph not only reflects the dependency relationships between code files and modules but also contains key information related to vulnerabilities, providing strong support for subsequent vulnerability detection.
The detection accuracy and potential harm of the proposed method are quantitatively evaluated by calculating the error and loss of vulnerability attacks.

The experimental results show that the proposed knowledge graph method has a high accuracy in named entity recognition and low false and missed detection rates in vulnerability detection, indicating that the detection results of this method are more reliable and can effectively locate and repair real vulnerabilities in the code.

Author contribution

Peng Xiao—Conceptualization, Resource, Writing Lina Zhang—Methodology, Writing Ying Yan—Supervision, Resource Zhenhong Zhang- Methodology, Supervision.

Funding

The study was supported by Yunnan Power Grid Co., LTD. Technology project “ smart grid equipment firmware vulnerability detection technology”(No: YNKJXM20220088).

Declarations

Ethics approval and Consent to Participate

Not applicable.

Consent to publication

Not applicable.

Competing interest

The authors declare no competing interests.

Data availability

The raw data can be obtained on request from the corresponding author.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Qasem A, Shirani P, Debbabi M, Wang L, Lebel B, Agba BL. Automatic vulnerability detection in embedded devices and firmware: survey and layered taxonomies. ACM Comput Surveys. 2022;54(2):25.1–42.

2. Ma Q, Wu Z, Wang Y, Wang X. Approach of web application access control vulnerability detection based on state deviation analysis. Comput Sci. 2023;50(02):346–52.

3. Senanayake J, Kalutarage H, Al-Kadri APL. Android source code vulnerability detection: a systematic literature review. ACM Comput Surv. 2023;55(9):1.1–37.

4. Jain VK, Tripathi M. An integrated deep learning model for Ethereum smart contract vulnerability detection. Int J Inf Secur. 2024;23(1):557–75.

5. Yang H, Yang H, Zhang L, Cheng X. Feature dependence graph based source code loophole detection method. J Commun. 2023;44(01):103–117.

6. Wen M, Wang R, Jiang S. Source code vulnerability detection based on relational graph convolution network. J Comput Appl. 2022;42(06):1814–21.

7. Ehrenberg M, Sarkani S, Mazzuchi TA. Python source code vulnerability detection with named entity recognition. Comput Secur. 2024;140:103802.1–103802.15.

8. Bhandari GP, Assres G, Gavric N, Shalaginov A, Grnli TM. IoTvulCode: AI-enabled vulnerability detection in software products designed for IoT applications. Int J Inf Secur. 2024;23(4):2677–90.

9. He J, Cai R, Yin X, Lu X, Liu S. Detection of web command injection vulnerability for Cisco IOS-XE. Comput Sci. 2023;50(04):343–50.

10. Toprak A, Turan M. Enhanced named entity recognition algorithm for financial document verification. J Supercomput. 2023;79(17):19431–51.

11. Lu Q, Yuan L. Software named entity recognition simulation based on combined neural network. Comput Simul. 2023;40(01):489–492+509.

12. Ahin CB. Semantic-based vulnerability detection by functional connectivity of gated graph sequence neural networks. Soft Comput. 2023;27(9):5703–19.

13. Pise RG, Patil S. Pioneering automated vulnerability detection for smart contracts in blockchain using KEVM: Guardian ADRGAN. Int J Inf Secur. 2024;23(3):1805–19.

14. Porkodi S, Kesavaraja D. Smart contract: a survey towards extortionate vulnerability detection and security enhancement. Wirel Netw. 2024;30(3):1285–304.

15. Zhao M, Li D. A method of software source code vulnerability detection based on BGRU. Mod Electron Techn. 2022;45(18):57–62.

Word count: 5462

Show less

Static detection method for multi-level network source code vulnerabilities based on knowledge graph technology

Content area

Abstract

Full text