Full text

Turn on search term navigation

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

As the software industry rapidly advances, the scale and logical complexity of software have significantly increased, and software has been widely applied and promoted in various fields [1]. While software brings convenience to people’s lives and work, the accompanying software vulnerabilities also pose unprecedented challenges to information security [2]. Hackers may exploit software vulnerabilities to engage in illegal activities such as obtaining advanced permission and accessing unauthorized data, leading to the leakage of personal privacy, trade secrets, and even sensitive national information. Traditional software testing typically focuses on functionality and robustness, with little consideration given to potential vulnerability issues. Consequently, the task of vulnerability discovery is a focus of growing concern [3].

Currently, fuzz testing is an effective method for discovering software vulnerabilities. However, the variation in most traditional fuzz testing methods is random, which is reflected in the selection of mutation positions [4]. Random mutation can change the original file format of test cases, resulting in many invalid test cases that are rejected by software due to syntax or structural errors in the early stages of testing. In addition, most fuzz testing methods improve code coverage, believing that code coverage has a strong correlation with bug coverage, and as long as more code paths can be discovered, more vulnerabilities can be found. However, the process of triggering vulnerabilities is complex, and executing fuzzer tests on paths containing errors may not necessarily trigger vulnerability. Therefore, fuzz testing methods based on code coverage require considerable time to cover many invalid code paths. In addition, even if executed to a code path containing vulnerabilities, vulnerabilities may not be triggered, leading to issues such as wasted computing resources and low efficiency [5]. Moreover, most mutation-based fuzz testing methods use fixed mutators during testing [6]. The fuzz testing method predesigns a mutator considering the attributes of the target program, and the mutator uses a fixed mutation strategy during testing, resulting in the generated test cases being unable to trigger more vulnerabilities [7].

In response to the above issues, this paper proposes a directed fuzz testing method based on a feedback mechanism mutator, DocFuzz. The aims of this work are as follows:

• To solve the problems of random selection of mutation positions and destruction of the file structure in fuzz testing, which leads to the rejection of test cases in the early stage of fuzz testing, a directed fuzz testing preprocessing method is designed, which generates an effective byte set as the mutation position in the test cases by inserting stakes and tracking stains on the target program. Compared with traditional fuzz testing methods without preprocessing, the syntax and structure validation pass rates of test cases are significantly improved.

• To solve the problem of existing fuzz testing methods consuming a large amount of resources on invalid code paths and insufficient exploration of critical paths, this paper introduces vulnerability coverage as a metric and proposes a reinforcement learning mutation method based on an improved UCB algorithm. The effective byte set is used as the mutation location, and a reinforcement learning algorithm is used to find mutation methods on the critical path that can trigger vulnerabilities to improve vulnerability coverage. The experimental results indicate that the method proposed in this paper has greater efficiency in discovering vulnerabilities.

• To further improve the effectiveness of the mutation method in this paper, a feedback machine is designed for the reinforcement learning mutator. After the reinforcement learning mutator converges and stops testing, the mutator is restarted and dynamically adjusted through a feedback mechanism, allowing the fuzz testing to continue in the direction of triggering vulnerabilities. The effectiveness of the feedback mechanism has been demonstrated through experiments.

• A vulnerability testing system prototype DocFuzz was designed and implemented. The experimental results show that compared with the baseline, the method proposed in this paper has stronger vulnerability detection ability and can trigger vulnerabilities faster. Meanwhile, the trained reinforcement learning model is portable and can quickly mutate and generate test cases that trigger vulnerabilities on different types of programs.

2. Related Work

At present, mainstream fuzz testing methods mainly include mutation-based fuzz testing methods and generative fuzz testing methods [8]. The mutation-based fuzz testing method randomly modifies existing valid input data to generate new test cases, aiming to detect errors caused by abnormal inputs in software. The generative fuzz testing method constructs input data directly based on predefined rules or models to systematically test the ability of software to handle various structured and unstructured inputs. Among mutation-based fuzz testing methods, code coverage-based fuzz testing methods dominate [9].

In mutation-based fuzz testing, AFL [10] is a fast mutation-based fuzz testing tool that uses program analysis techniques and heuristic algorithms to automatically generate test cases with high coverage to discover vulnerabilities in software programs, but its mutation is random. AFLFast [11] is an improved version of AFL that uses CPU caching, process-based mutation, and other acceleration techniques to enhance the efficiency of fuzz testing. However, this method still does not solve the problem of random mutation. VUzzer [12] is fuzz testing tool that combines random mutation and symbol execution techniques to generate high-quality test cases and discover vulnerabilities in software programs. However, it uses symbol execution techniques, resulting in high-performance overhead for this method.

In code coverage-based fuzz testing, Angora [13] uses a new symbol execution engine to automatically generate test cases with high coverage and discover vulnerabilities in software programs. Selectfuzz [14] is the latest statistical learning-based fuzz testing tool that uses a new mutation method to generate test cases and discover vulnerabilities in software programs. However, both Angora and Selectfuzz spend considerable time exploring invalid paths, ignoring critical paths that may have vulnerabilities. Steelix [15] is an efficient fuzz testing tool that focuses on comparison operations within binary programs through instrumentation. It utilizes the concept of “magic bytes” to locate paths protected by “magic bytes” and generates inputs that can trigger specific paths, thereby improving the testing efficiency and vulnerability detection rate. However, the specific path triggered by this method is not a potentially vulnerable path but rather a path that traditional fuzzers find difficult to reach, so fundamentally, it is still further improving code coverage.

In addition, most fuzz testing mutators are fixed and do not focus on mutators that can be dynamically adjusted. ClassFuzz [16] is a fuzz testing tool based on the Metropolis–Hastings algorithm [17], which increases the probability of selecting mutation operations that explore more new states. However, this method only selects different mutation operations based on probability and does not optimize the mutator. MOPT [18] introduces a fuzz testing method utilizing the PSO algorithm. This method simulates the process of mutation selection through PSO, gradually searching for higher probabilities of different mutation operations being selected, thereby constructing a probability distribution for selecting mutation operators. Subsequent fuzz testing activities will select mutation operations based on the probability distribution. The above methods only select different mutation operators based on probability algorithms and do not optimize the mutator. AFLChurn [19] proposed a fuzz testing method based on the ant colony optimization (ACO) algorithm. This method revealed the superiority and inferiority of changing bytes. If the change in bytes improves the quality of seeds, the score of that byte will increase, and vice versa. With the continuous use of fuzzers, the score for each byte gradually decreases over time. Therefore, AFLChurn prioritizes bytes with higher scores. However, this method does not use byte sets to guide the evolution of the mutator.

The reinforcement learning fuzzer trains a mutator through reinforcement learning algorithms to solve the seed scheduling problem and improve testing performance. However, although the trained reinforcement learning mutator still uses a fixed mutation strategy, it is possible to train a mutator with dynamically adjusted mutation strategies through reinforcement learning algorithms. RiverFuzzRL [20] is a fuzzer based on reinforcement learning. RiverFuzzRL focuses on binary files that do not have access to source code; it allows customization in multiple dimensions, such as state representation, actions, rewards, model selection, and training algorithms. The main contributions of this tool include the combination of forward tracking and symbol constraint propagation, the ability to test stateful programs, and the ability to evaluate different training models. RiverFuzzRL fills the gap in open-source tools based on reinforcement learning for fuzz testing. However, this method has limited adaptability to specific types of software testing and may require more complex configurations and adjustments in practical applications. BertRLfuzz [21] is a fuzz testing tool that combines the bidirectional encoder representation converter (BERT) model and reinforcement learning. It automatically generates attack vectors targeting applications by learning syntax representations from seed inputs and fine-tuning them through a reinforcement learning fuzzing loop. This method enables the fuzzer to efficiently generate attack vectors that comply with syntax without the need to manually write syntax for attack patterns. It outperforms traditional black box–based fuzzers, white box–based fuzzers, and machine learning–based fuzzers in detecting security vulnerabilities. However, this method relies on the quality and representativeness of the input seeds and has a high false alarm rate when the initial seed input is incomplete or lacks certain types of attack patterns. For situations where the initial seed input is incomplete or lacks certain types of attack patterns, this tool may miss some vulnerabilities.

3. Overview of the Proposed Method

The framework of the fuzzing method based on a mutator is illustrated in Figure 1. First, the input of the fuzzer is the executable file or source code and test cases of the target program. Second, the fuzzer’s “monitor” component gathers specific runtime data through the fuzzing, which guides the mutation of test cases. This information is forwarded to the “mutator” component. The mutator then generates and selects test cases according to its mutation strategy, sending these selected cases back to the “monitor” component for input into the target program. The iterative process continues until fuzz testing is concluded. During fuzz testing, the fuzzer also reports information such as execution time, the count of program crashes, and execution speed. The continuous process of mutation and monitoring is called a fuzzing loop, which records test cases that lead to crashes in the target program during this process. Eventually, the stored crash data are analyzed to uncover vulnerabilities [22].

[figure(s) omitted; refer to PDF]

The fuzzing method based on mutators has the problems of randomly selecting mutation positions and damaging file structures, leading to test cases being rejected in the early stages of fuzz testing. Moreover, this method consumes a large amount of resources to explore invalid code paths, resulting in insufficient exploration of critical paths. In addition, almost all mutation-based fuzzers use fixed mutators during testing. The fuzzer predesigns some mutators based on the target application features, and these mutators do not change their mutation strategy during the fuzz testing period [23]. In response to the above issues, this paper proposes a directed fuzz testing method based on a feedback mechanism mutator, as shown in Figure 2. The method consists of three parts: directed fuzz testing preprocessing, a reinforcement learning fuzzing loop, and a feedback mechanism. The functional design of each part is as follows:

1. Directional fuzz testing preprocessing: This method preprocesses the fuzzing loop module of traditional mutation-based fuzz testing methods in two main steps: target acquisition and stain tracking. The target acquisition part uses a disinfector-based instrumentation algorithm to stake in the location of potential vulnerabilities in the program. The input is a C/C++ file, and the output is an executable program with suspicious vulnerabilities. The stain tracking part first executes multiple valid sample inputs on the executable program generated in the previous stage and then uses the stain tracking algorithm to associate the code block being instrumented with the bytes in the test case that affect the program’s execution to the instrumentation position. Finally, it outputs a high-value byte set to help directed fuzz testing better hit the vulnerability target.

2. Reinforcement learning fuzzing loop: The inputs to the reinforcement learning fuzzing loop are executable programs with suspicious vulnerabilities, test cases, and high-value byte sets. The reinforcement learning mutator first transforms the testing process into a Markov chain based on the characteristics of fuzz testing and abstracts the scheduling problem of its mutator into a multi-armed bandit machine problem. Then, the UCB algorithm is used to train the reinforcement learning mutator. Compared with the traditional method of fuzzing loops, this method can interact better with the program and mutate test cases that can trigger vulnerabilities more effectively.

3. Feedback mechanism: After the reinforcement learning mutator converges, the feedback mechanism is activated. First, the Docfeedback algorithm is used to filter out the test cases with the most hit vulnerability targets mutated by the reinforcement learning mutator. Then, it is used as the input for stain tracking to perform the stain analysis again, expanding the previous byte set and adding new mutation positions to the reinforcement learning mutator so that the reinforcement learning mutator can restart training for further optimization. As a result, DocFuzz will increasingly understand the essence of vulnerabilities and trigger them more comprehensively.

[figure(s) omitted; refer to PDF]

4. Preprocessing for Directed Fuzzing

The input of software generally has the characteristic of high structuring, and in fuzz testing, the input format is crucial for generating high-quality input. The well-formatted input is correctly parsed and processed to obtain the desired results. Traditional fuzzers generate many formatted inputs during the random mutation process, which are filtered and discarded by the robustness checker in the program, resulting in a significant waste of resources and reduced testing efficiency [24].

In this paper, preprocessing for directed fuzz testing is used to solve this problem. This preprocessing strategy for directed fuzz testing includes two main parts: target acquisition and taint tracking.

4.1. Target Acquisition

The target refers to the code path that is more likely to have vulnerabilities. The purpose of target acquisition is to guide fuzz testing to execute code with vulnerabilities. This paper uses the development of Google Sanitizers [25, 26] for target acquisition, whose function is to help discover possible vulnerabilities in the tested program and perform source code instrumentation annotation.

To identify the basic blocks of instrumentation that contain suspected vulnerabilities, this paper conducts a black box analysis of the differences between source code files compiled with and without disinfectors. First, all the basic blocks in the source file are stored in one file. Second, the original file is stacked, and the basic blocks that may have vulnerabilities are marked. Then, the source file is compared with the basic block file containing instrumentation annotations. Finally, the basic blocks that have been inserted into the pile are identified.

This paper supports the detection of multiple vulnerabilities, and the type of vulnerability detected depends on the choice of disinfectant type. ASan [27] is a type of Google Disinfectant that is mainly used to detect overflow vulnerabilities. In this section, the ASan tool is used for instrumentation operations, and the specific operations are as follows.

The ASan runtime library substitutes memory allocation functions, so that the memory allocation steps of the application are undertaken by the memory allocator of ASan. The ASan memory allocator requests memory near the heap memory it allocates to detect heap overflows and first places the freed memory in a quarantine area to detect heap memory errors such as heap-use-after-freedom.

When compiled with ASan-related compilation parameters, all memory access operations in the code are altered by the compiler. Before accessing memory, the compiler determines whether the memory that is about to be accessed is in a poisoned state. If it is poisoned, an error message is output; otherwise, the memory is accessed normally. An uninstrumented read operation is shown in Figure 3, and a read operation conducted after sanitizer instrumentation is shown in Figure 4. By comparing these two files, the differing code blocks are defined as the instrumented blocks.

[figure(s) omitted; refer to PDF]

Since the target acquisition method described in this paper does not rely on the type of sanitizer selected, different sanitizers can be used for fuzz testing according to the presence of different types of vulnerabilities.

4.2. Taint Tracking

To maintain well-formed, highly structured inputs, the solution proposed in this paper is to identify the key bytes that truly affect the execution of the program to suspicious vulnerability locations, form a byte set, and then mutate only the contents within this byte set to ensure the syntactic legality of the inputs.

The taint tracking method in this paper is inspired by the approach described in the literature [28]. First requires the tested program to execute one or more valid sample inputs. Each time it reaches the target code block, the inserted stain tracking records which bytes in the test case affect the values of the target code block and ultimately generates a valuable set of bytes in the test case. The effective byte set is implemented using a hash table to determine the type of key byte and their position in the test case.

The taint tracking path is shown in Figure 5. During the execution of the program, the tainted variables dec ⟶ width and dec ⟶ height propagate to the image_size value in clipconv 8 × 8_u8_s16_c and subsequently to the ptr parameter for computation purposes. As part of its taint tracking calculation, DocFuzz logs that the input bytes 0 × 1843 from the original Flash file influence the parameter values when calling clipconv 8x8_u8_s16_c.

[figure(s) omitted; refer to PDF]

The use of taint information allows the entire fuzzing method to operate in a fully automated manner, without the need for testers to understand the structure, content, or other a priori knowledge of the input, significantly reducing the manual costs incurred. The taint tracking module, in essence, observes the program’s handling of test case bytes to identify modifications that maintain the original file’s syntactic integrity. As a result, DocFuzz demonstrates efficacy with both simple and complex input file formats.

5. Reinforcement Learning Fuzzing Loop

In traditional fuzz testing methods, when a test is executed on a certain code path, the fuzzer based on code coverage assumes that the path has already been reached and will not further generate test cases that cover that path. However, triggering vulnerabilities is difficult, and simply executing on that path is insufficient. Therefore, this paper introduces vulnerability location coverage as a metric, replacing the original code coverage with the coverage of suspicious vulnerability code paths obtained during the target acquisition stage. This metric is beneficial for fuzz testing to mutate more test cases that can hit suspicious vulnerability code paths, thereby exposing vulnerabilities at the target location.

This section proposes a reinforcement learning fuzzing loop that can effectively discover vulnerability features based on feedback information during the testing process to find better mutation methods to trigger vulnerabilities. This chapter introduces two parts: the design of reinforcement learning mutator and the training of reinforcement learning mutator.

5.1. Design of the Reinforcement Learning Mutator

The core challenge of fuzz testing mutators is how to mutate input data that can trigger vulnerabilities. This paper proposes that this problem can be abstracted as a balance problem between when to find the optimal mutation strategy during the mutation process and when to apply the current most effective mutation method. Therefore, in this paper, the problem is treated as a typical exploration–exploitation problem [29] in reinforcement learning by using the approach in reinforcement learning to design a reinforcement learning mutator.

The multi-armed bandit problem is a classic example of an exploration–exploitation problem [30], where exploration refers to trying to pull a multi-armed bandit machine that has not yet been pulled to gain more information, and exploitation involves choosing a multi-armed bandit machine with a known higher return to maximize profits. This characteristic aligns with the problem that fuzz testing aims to solve. The fuzzing problem also involves a trade-off between when to explore how to mutate the better input and when to apply the current best mutation method.

This paper formalizes the fuzz testing as a reinforcement learning problem and applies the algorithm of multi-armed bandit machines to solve the difficulties of the fuzz testing mutator. For the improved multi-armed bandit machine problem, the definition of action is to choose one mutation method from multiple different mutation methods. Action is aimed at the mutation of characters composed of valid byte sets, and after the mutation is completed, a new test case is assembled based on the position information of the valid byte set. The state is the mutated test case, while the reward is the indicator of the number of basic blocks triggered to be plugged.

5.1.1. States

The model considers the state observed by the agent to be the test case. Consider a finite set of symbols denoted by $\sum$ . All of possible test cases composed of this set are represented by the Kleene closure. For an input string [31], let $x = x_{1} \dots, x_{n}$ . $\begin{matrix} (1) & S x ≔ x_{1 + i}, \dots, x_{m + i} i \geq 0, m + i \leq n, \end{matrix}$ where S(x) represents the set of all substrings of x. The state of a Markov decision is defined as I. Here, x ∈ I represents the input test cases, while $x^{'}$ ∈ S(x) ⊂ I represents the substrings of these test cases.

5.1.2. Actions

A set of possible actions X is defined for a Markov decision process as the choice of a certain variant method that maps the input substring to multiple variant methods: $\begin{matrix} (2) & X ≔ a : I ⟶ I \times I, N, P a \sim π x^{'}, \end{matrix}$ where the algebra of the sample space (I × I) is denoted by N = σ(I × I) and each specific variation method is selected with a certain probability. The variation methods constitute a subset of the probable string rewriting rules defined to work on the provided seed input. These well-defined rewriting rules are the arms of the multi-armed bandit problem. The rules are inspired by the approach described in the literature [31].

5.1.3. Reward

We define the reward independently for (1) the next executed action a and (2) the input x generated for the next executed program, i.e., $r x, a$ . The model tries to use E to provide the count of newly found blocks, the length of the path, and the execution time for processing the input x to reach the target. We define the count of basic blocks of the executed target as $E_{1} x, I^{'}$ : $\begin{matrix} (3) & E_{1} x, I^{'} ≔ B c_{x} \ \underset{χ \in I^{'}}{\cup} B c_{χ} . \end{matrix}$

In the context of fuzz testing, $c_{x}$ represent the execution path taken by the program when executing test case x, and $B c_{x}$ represent the basic blocks executed along the path. Additionally, let $I^{'} \subset I$ denote the subset of test cases that have been processed previously.

For the modified multi-armed bandit machine problem, different mutation methods are defined as multiple arms on the multi-armed bandit machine, which means that the action is to choose a mutation method, the state is the mutated input, and the reward is an indicator of the count of target basic blocks touched; the final goal is to cover more target bases for executing toward the target basic block and testing it.

5.2. Training Process of the Reinforcement Learning Mutator

In this paper, the reinforcement learning mutator is abstracted into a multi-armed bandit problem, and algorithms for multi-armed bandits are used in the training process of the reinforcement learning mutator.

Many highly effective algorithms are available for multi-armed bandits, such as the naive greedy algorithm, epsilon-greedy algorithm, UCB algorithm [32], and Thompson sampling algorithm [33]. To effectively solve the fuzzing problem using multi-armed bandit algorithms, this paper designs comparative experiments involving the above algorithms. Through these experiments, the most suitable algorithm for addressing the problem at hand is determined.

This paper applies several algorithms to a well-defined reinforcement learning mutator for 1000 rounds of fuzzing, observing the average reward convergence speed of each algorithm and analyzing which algorithm is most suitable for this type of fuzzing.

Figure 6 shows that both the ordinary greedy algorithm and the epsilon-greedy algorithm converge quickly. For fuzzing, quickly selecting a mutation method to continue the mutation process goes against the core of the exploration and exploitation problem, losing the meaning of the testing strategy. Comparing the UCB algorithm and the Thompson sampling algorithm, the UCB algorithm is more balanced in terms of exploration and exploitation. In the long term, it can obtain more rewards and trigger more vulnerabilities.

[figure(s) omitted; refer to PDF]

Therefore, for the training process of the reinforcement learning mutator proposed in this paper, the UCB algorithm is chosen. The advantages of this algorithm are that it is simple to implement and that it has good theoretical performance guarantees.

The core idea of the UCB algorithm is to maintain a confidence interval for each arm, where the upper limit of the confidence interval is determined by the historical average of the rewards and the level of confidence. In each operation, the UCB algorithm chooses the arm with the highest upper confidence bound, aiming to obtain the maximum reward. As the number of operations increases, the UCB algorithm gradually determines the true reward distribution of each arm by continuously updating the confidence intervals, thereby gradually optimizing the operation strategy. First, the return on investment after confidence is reached is defined as follows: $\begin{matrix} (4) & {\hat{μ}}_{i, t} = \frac{\sum_{s = 1 : I_{s} = i^{γ_{s}}}^{t}}{n_{i, t}}, \end{matrix}$ where $γ_{s}$ is the reward observed at time s and $μ_{i, t}$ represents the count which arm i has been chosen up to the current time t. The new reward refers to the proportion of times selecting the arm results in a reward out of the total number of selections for that arm. This can also be described as the average reward value. Subsequently, a final value is assigned to each arm: $\begin{matrix} (5) & U C B_{i, t} ≔ {\hat{μ}}_{i, t} + \sqrt{\frac{\ln t}{n_{i, t}}}, \end{matrix}$ where t is the number of times that the arm is shaken and $μ_{i, t}$ is the number of times arm i is shaken by time t. The latter part of the formula represents the confidence level, which is used to express the confidence of the estimate in the first part. The process of the UCB algorithm is shown in Algorithm 1.

Algorithm 1: The UCB algorithm of DocFuzz.

Input: High-value byte set

Output: Mutated test cases

Initialize: For each arm $i$ , set the average reward to 0 and the number of selections to 0. Perform one selection operation for each arm i.

1. for t in range(num_time_steps):

2. for i in range(num_selections_per_step):

3. Calculate the upper confidence bound for each arm i:

upper_bound_i = average_reward_i + sqrt((2 $*$ ln(t $*$ num_selections_per_step + i + 1))/selectcount_i)

4. Select the arm with the maximum upper bound for its confidence interval: chosen_arm = argmax_i(upper_bound_i)

5. Execute the selection operation and obtain the reward.

6. Update the average reward and selection count for the chosen arm:

select_count[chosen_arm] + = 1

average_reward[chosen_arm]+=(reward-avearge_reward[chosen_arm])/select_count[chosen_arm]

7. end for

8. end for

6. Feedback Mechanism

Traditional fuzz testing methods use fixed mutators during testing, which cannot change the mutation strategy during testing and therefore cannot better trigger vulnerabilities with mutated inputs. In addition, the UCB algorithm used by the reinforcement learning mutator can gradually discover potential high-return actions. However, due to the exploratory nature of algorithms and their ability to handle uncertainty, it is not guaranteed to find the global best action, especially in the initial stage or when information is limited. Therefore, in the process of selecting mutation methods, the reinforcement learning mutator has the following problems due to the limited number of valuable bytes in the initial stage: (1) The number of triggered vulnerabilities is relatively small, and (2) the mutator quickly converges to a mutation method, thus stopping exploration.

To address the aforementioned issues, the Docfeedback algorithm is presented. In response to the first issue, the algorithm selects the top five test cases that hit the target the most as inputs for another stain tracking operation by comparing the efficiency of stain tracking to expand the previous valuable byte set. Regarding the second issue, the feedback mechanism uses the expanded set of valuable bytes as the mutation location for a new round of reinforcement learning mutators, restarts the reinforcement learning mutator, and better tests deeper code regions and learns vulnerability features, achieving better vulnerability mining capabilities. The feedback mechanism can optimize the mutator, allowing it to mutate more test cases that reach the target code block.

6.1. Initiation of the Feedback Mechanism

First, we introduce the conditions that trigger the feedback mechanism. When the reinforcement learning algorithm converges, the feedback mechanism is activated, and this paper elaborates on the conditions for judging the convergence of reinforcement learning. Judging the convergence of reinforcement learning requires a comprehensive consideration of the confidence interval width of each action, the difference in benefits, and the stability of action selection.

The threshold of the confidence interval width determines the level of confidence in the value estimation of each action. In an ideal scenario, this threshold should be small enough to ensure the accurate estimation of actions but not too small to prevent the algorithm from prematurely stopping exploration. The formula is as follows: $\begin{matrix} (6) & C W_{i} = \sqrt{\frac{2 \ln n}{n_{i}},} \\ (7) & \max_{i} C W_{i} \leq θ_{C W}, \end{matrix}$ where $n_{i}$ is the number of times an action has been selected. ${\bar{x}}_{i}$ is the average return of the action. $C W_{i}$ is the confidence interval width of the action. The maximum confidence interval width of the action is represented by $\max_{i} C W_{i}$ . $θ_{C W}$ is the threshold for the confidence interval width. In this article, the value of $θ_{C W}$ is set to 0.2, which means that when the maximum confidence interval width is less than 0.2, the action selection is relatively stable and does not change much.

The threshold for profit difference is used to measure the maximum profit difference between different actions. A smaller value means that the average profit of all actions is close and that there is no significant optimal action. $\begin{matrix} (8) & \max_{i} {\bar{x}}_{i} - \min_{i} {\bar{x}}_{i} \leq θ_{R D} . \end{matrix}$

Among them, $\min_{i} {\bar{x}}_{i}$ is the highest average return, $\min_{i} {\bar{x}}_{i}$ is the lowest average return, and $θ_{R D}$ is the threshold for the difference in returns. This article sets the value of $θ_{R D}$ to 0.1, which means that when the maximum average return difference of all actions is less than 0.1, this indicates that no single mutation method is significantly better than other methods. Therefore, it can be considered that the algorithm has reached a balance among multiple mutation methods.

The stability of action selection is measured by the count of consecutive selections of the same mutation method. A higher value means that the algorithm consistently chooses the same action for a longer period, thus exhibiting a high degree of decision consistency. S is the count of times the same action is chosen consecutively, and this article designs an algorithm to obtain S, as shown in Algorithm 2: Check Stack Action Selection.

Algorithm 2: Check stable action selection.

Input: A list of actions

Output: A boolean indicating if there is a stable action selection

Initialize: Set current_action to the first action in the list and count to 1.

1. if length of the list of actions is less than the stability threshold S

2. return False

3. for each action in the list from the second element to the last

4. if action equals current_action

5. increment count by 1

6. if count ≥ S

7. return True//Stability condition met

8. else

9. set current_action to action

10. set count to 1

11. return False//No stable selection found after processing all actions

First, we determine whether the length of the action list is less than the threshold S for consecutive selections; if so, we return False directly. Then, the algorithm traverses the action list, and for each action, it checks whether it is the same as the previously selected action. If they are the same, the counter increases; if the counter reaches or exceeds S, it indicates that a stable selection that meets the condition has been found and returns True. If the current action is different from the previous action, the current action is reset, and the counter is reset to 1. When traversing the entire list, if no conditions are found that meet the criteria, false data are returned. This algorithm determines the stability of action selection by continuously checking the repeatability of actions.

Finally, the three key elements mentioned above are used to determine whether the reinforcement learning mutator is convergent, and the formula is as follows: $\begin{matrix} (9) & Converged = \max_{i} C W_{i} \leq θ_{C W} \land \max_{i} {\bar{x}}_{i} - \min_{i} {\bar{x}}_{i} \leq θ_{R D} \land S times stable choice . \end{matrix}$

When the confidence interval width and profit difference of each action are less than the threshold and the same mutation method is continuously selected a specific number of times, the mutator converges and starts to activate the feedback mechanism.

6.2. Docfeedback Algorithm of the Feedback Mechanism

The core principle of the feedback mechanism is that the valuable byte set obtained from stain tracking is directly related to the location of potential vulnerabilities. Therefore, mutating these bytes is most likely to trigger vulnerabilities, as shown in Figure 7. In the feedback mechanism, new test cases are used for stain tracking to find more valuable bytes and provide them to the reinforcement learning mutator for mutation, thereby enabling the fuzzer to explore vulnerabilities.

[figure(s) omitted; refer to PDF]

Algorithm 3 is the Docfeedback algorithm. First, using the byte set closely related to the suspicious vulnerability location obtained in the preprocessing step, in the subsequent reinforcement learning mutator, only valuable byte sets are mutated to ensure the syntactic and semantic correctness of the input obtained through mutation. Reinforcement learning is used to mutate and generate test cases that trigger vulnerabilities.

Then, after a round of reinforcement learning convergence, the test cases generated during reinforcement learning convergence are used as inputs for stain tracking. In this paper, by comparing the efficiency of stain tracking on programs in the LAVA-M dataset, the top five test cases that hit the most targets are ultimately selected as inputs for another stain tracking operation to update the previously valuable byte set. The use of stain tracking and reinforcement learning mutually promotes each other to better test deeper code regions and learn the characteristics of vulnerabilities.

Algorithm 3: Docfeedback algorithm.

Input: High-value byte set

Output: Updated high-value byte set

Initialize: DocFuzz determines that reinforcement learning has converged.

1. Select the top five test cases that hit the most targets:

taint_results.sort(key = lambda x: x[“hits”], reverse = True)

top_five_cases = taint_results[:5]

return top_five_cases

2. Use the selected top five test cases to update the valuable byte sets:

updated_bytes = []

for case in top_five_cases:

updated_bytes.extend(taint tracking(case))

return updated_bytes

3. Update the valuable byte set:

update_valuable_bytes.extend(updated_bytes)

7. Experiments and Results

7.1. Experimental Datasets

To verify the detection performance of DocFuzz, it is evaluated across three distinct types of datasets: (1) two real-world programs, (2) the LAVA-M dataset [34], and (3) the DARPA CGC dataset [35].

The above datasets are commonly used test sets in the field of fuzz testing. The two employed real-world programs are the Swfdec Adobe Flash Player (version 0.5.5) and the MuPDF PDF viewer (version 0.1). In the experiments, these two programs are used to validate the legal passing rate of the inputs processed by the taint tracking module.

The LAVA-M dataset is a subset of the LAVA dataset that consists of four GNU coreutils programs—base64, md5sum, uniq, and who—each of which is injected with 44, 57, 28, and 2136 bugs, respectively [34]. All errors are protected by a four-byte magic number comparison, which triggers a bug only when the preset condition is met. The performances of DocFuzz and the currently available classic fuzzers are compared via detection experiments conducted on this dataset [34].

The DARPA CGC dataset is a computer security dataset funded by the Defense Advanced Research Projects Agency (DARPA) to promote the development of automated vulnerability discovery and repair technologies [35]. This dataset consists of many binary programs from the real world, including operating systems, services, and applications, totaling 60 programs. In this paper, we select eight representative programs from the dataset to demonstrate the comprehensive capabilities and universality of DocFuzz [35].

7.2. Taint Tracking Effectiveness

Verification Experiment: To verify the effectiveness of the taint tracking module, traditional fuzzing methods and the taint tracking module are used to process input files and induce random mutations in the files, and the legal pass rates of the inputs processed by different methods are compared.

First, AFL and DocFuzz are used to mutate the input files, resulting in 1000 test cases each. Then, two open-source C language applications, the Swfdec Adobe Flash Player (version 0.5.5) and the MuPDF PDF viewer (version 0.1) [28], are used to evaluate the above 2000 test cases. The evaluation results are shown in Table 1.

Table 1

Comparison of pass rates between after taint tracking and randomly mutated inputs.

Program	Pass rate of randomly mutated inputs (%)	Input pass rate after taint tracking (%)
MuPDF	57	87
Swfdec	64	90

As shown in Table 1, the taint tracking module of DocFuzz significantly enhances the valid pass rate of the test cases; this ensures that the mutated inputs are not rejected by the program in the initial stages of testing, allowing for deeper program path exploration. Consequently, the fuzzer is better equipped to learn vulnerability characteristics, facilitating the training of a reinforcement learning mutator that is capable of dynamic adjustment, thereby yielding improved testing performance.

7.3. Feedback Mechanism of the Reinforcement Learning Mutator

7.3.1. Selection of the Number of Test Cases for the Feedback Mechanism

In this paper, we select programs from the LAVA-M test set as experimental subjects, and fuzz testing is conducted on these programs using varying numbers of test cases ranging from 1 to 10. For example, when using three cases, after one round of reinforcement learning converged, the three test cases that hit the most targets are selected as taint tracking inputs. This experiment is conducted with 100 rounds of reinforcement learning under different numbers of test cases, and the average duration of taint tracking and the average number of expanded effective byte sets are recorded for these 100 rounds. The efficiency of taint tracking is used as the evaluation metric for selecting the number of test cases in the adaptive mechanism. Here, taint tracking efficiency is defined as the average number of expanded effective bytes divided by the average total taint tracking duration.

According to Table 2, when the number of test cases is 5, the taint tracking efficiency is 0.73, meaning that on average, 0.73 effective bytes are found per second through taint tracking; this is the most effective result. Therefore, in this paper, five test cases are selected for the adaptive mechanism.

Table 2

Results produced with different numbers of test cases for the LAVA-M test set.

Number of test cases	Average duration of taint tracking (s)	Average number of expanded effective byte sets per round	Taint tracking efficiency (items/s)
1	5.0	2	0.4
2	11.3	3	0.26
3	15.6	8	0.51
4	20.3	11	0.54
5	24.6	18	0.73
6	30.2	19	0.62
7	35.6	20	0.56
8	41.0	20	0.48
9	44.5	21	0.47
10	49.9	21	0.42

7.3.2. Effectiveness of the Feedback Mechanism

As shown in Figure 8, during the 24-h test implemented on the LAVA-M dataset, DocFuzz undergoes three rounds of reinforcement learning until convergence at approximately 4.5–14 h and 21–23 h, during which the reinforcement learning mutator ceases its optimization process. However, due to the adaptive mechanism, new inputs are used for another round of taint tracking, thus expanding the byte set. The reinforcement learning process is restarted, and deeper issues are explored; this demonstrates that the adaptive mechanism could enable reinforcement learning to enter the next round of training once it finds what it deems the best mutation method, continuing the search for better mutation operations.

[figure(s) omitted; refer to PDF]

7.4. Vulnerability Discovery Capability Experiment

Verification experiments are conducted on both the LAVA-M and DARPA CGC datasets to assess DocFuzz’s capability in discovering vulnerabilities. DocFuzz is compared with state-of-the-art fuzzing tools, including AFL [10], AFLFast [11], VUzzer [12], Angora [13], SelectFuzz [14], and Steelix [15], and the experimental results are shown in Table 3. AFL [10] is a fast, mutation-based fuzzing tool that uses heuristic algorithms to automatically generate test cases with high coverage to discover vulnerabilities in software applications. AFLFast [11] is an improved version of AFL that utilizes acceleration technologies such as CPU cache utilization and process-based mutation to enhance the efficiency of fuzzing. VUzzer [12] is a mutation-based fuzzing tool that combines random mutation with symbolic execution techniques to generate high-quality test cases and discover vulnerabilities in software applications. Angora [13] is a fuzzing tool based on symbolic execution that uses a new symbolic execution engine to automatically generate high-coverage test cases and discover vulnerabilities in software applications. SelectFuzz [14] is the latest fuzzer and is a statistical learning-based fuzzing tool that uses a novel mutation method to generate high-coverage test cases and discover vulnerabilities in software applications. Steelix [15] is a model-based fuzzing tool that employs model checking techniques and statistical learning methods to generate test cases with high coverage and discover vulnerabilities in software applications.

Table 3

Vulnerability discovery capabilities exhibited by various fuzzing tools on the LAVA-M dataset.

Program	base64	md5sum	uniq	Who
TotalBugs	44	57	28	2136
AFL	7	2	7	0
AFLFast	9	0	0	18
VUzzer	17	1	22	50
Steelix	43	28	24	189
Angora	40	57	24	1010
SelectFuzz	44	57	28	1199
BertRLFuzzer	34	45	25	379
RiverFuzzRL	40	42	28	350
DocFuzz (ours)	44	57	28	1252

Note: The significance of the bold is to highlight the experimental effect of the method presented in this paper.

As shown in Table 3, compared with other detection methods, DocFuzz can detect more vulnerabilities. Specifically, it detects all base64, md5sum, and unique vulnerabilities and identifies the greatest number of vulnerabilities.

To measure the ability of DocFuzz to discover overflow vulnerabilities, we conduct overflow vulnerability detection experiments on the LAVA-M dataset using DocFuzz and famous fuzzing tools. The experimental results are presented in Table 4.

Table 4

Overflow vulnerability discovery capabilities exhibited by various fuzzing tools on the LAVA-M dataset.

Program	base64	md5sum	uniq	Who
OverflowBugs	35	40	20	1689
AFL	2	1	5	0
AFLFast	4	0	0	7
VUzzer	6	0	9	10
Steelix	23	18	14	29
Angora	30	38	26	456
SelectFuzz	34	39	20	678
BertRLFuzzer	31	26	9	34
RiverFuzzRL	28	33	14	41
DocFuzz (ours)	35	40	20	952

Note: The significance of the bold is to highlight the experimental effect of the method presented in this paper.

As Table 4 indicates, DocFuzz can detect more overflow vulnerabilities than the other methods; this is because DocFuzz integrates reinforcement learning methods, which allows it to learn the characteristics of overflow vulnerabilities, thereby more effectively and rapidly mutating inputs that can trigger overflow vulnerabilities.

To further validate the vulnerability discovery ability of DocFuzz, verification experiments are also conducted on the DARPA CGC dataset. The results are shown in Table 5.

Table 5

Vulnerability discovery capabilities exhibited by various fuzzing tools on the DARPA CGC dataset.

Program	AFL	AFLGO	Steelix	SelectFuzz	BertRLFuzzer	RiverFuzzRL	DocFuzz (ours)
CADET00001	✔	✔	✔	✔	✔	✔	✔
EAGLE00005	✔	✔	✖	✔	✖	✔	✔
NRFIN00010	✔	✔	✔	✔	✖	✖	✔
YAN0100001	✖	✖	✔	✔	✔	✔	✔
YAN0100003	✔	✔	✔	✔	✖	✖	✔
KPRCA00001	✖	✖	✔	✔	✔	✔	✔
KPRCA00015	✖	✔	✖	✔	✔	✖	✔
YAN0100002	✖	✖	✖	✖	✖	✖	✔

The DARPA CGC dataset was created to simulate real-world programs with known vulnerabilities, enabling researchers to realistically evaluate their vulnerability discovery tools [35]. Multiple sanity checks implemented on the input protect each program bug. The dataset includes a set of test cases that serve as evidence of vulnerabilities [35].

On the DARPA CGC dataset, this experiment utilizes eight sample CGC binary files, each with different types of vulnerabilities and different file types, as the input test cases. Each fuzzing tool is run for 3 h.

As indicated in Table 5, DocFuzz successfully identifies all the vulnerabilities in the eight binary files, demonstrating its superior comprehensive capabilities and universality to other methods. For YAN0100002, only DocFuzz can find a bug in 3 h. This functions as a tennis ball motion calculator with a bug, which is inspired by “The Patriot Missile Failure” [36]. DocFuzz will fully explore each mutation operation and train reinforcement learning mutators based on feedback information from the binary programs. When a certain input triggers a large number of floating-point calculations, DocFuzz will continue to explore the mutation operation and ultimately discover a vulnerability that other fuzzers find difficult to trigger.

7.5. Effective Mutation and Vulnerability Discovery Efficiency

To evaluate the vulnerability discovery capability, two experiments are designed: a mutation effectiveness verification experiment and a vulnerability discovery efficiency experiment. The program in the LAVA-M test set is selected as the experimental object, and each fuzz testing method is tested in the same experimental environment for 24 h.

In the experiment for verifying vulnerability capability, the fuzzer being tested is run on the target program being tested, and the proportion of effective mutations is recorded. The effective mutation refers to when the mutated test case triggers the target area of the pile being inserted.

As shown in Figure 9(a), the x-axis represents the running time of the fuzzer, while the y-axis depicts the ratio of effective mutations to the total number of mutations. Figure 9(a) shows the trend of variation efficiency over time for different fuzzers. As shown in the figure, DocFuzz demonstrates its mutation characteristics after using the feedback mechanism in this experiment. After the first round of reinforcement learning mutation convergence at 0–6 h, there is a significant improvement in efficiency at 6–11 h and 14–21 h, respectively; this is because the feedback mechanism readjusts and activates the mutator after recognizing reinforcement learning convergence, optimizing its effective mutation ability. AFLFast, VUzzer, and Steelix initially have higher efficiency but rapidly decrease over time, which usually indicates that they quickly discover easily detectable vulnerabilities. As these vulnerabilities are fixed, the remaining vulnerabilities become more difficult to detect, leading to a decrease in efficiency and a tendency to stabilize; this indicates that it may perform poorly in this testing scenario or may require more time to adjust and optimize its testing strategy. A slow and stable downward trend is exhibited, which means that alongside the continuous discovery of new vulnerabilities, these methods may gradually encounter more complex vulnerabilities or code paths. BertlFuzzer and RiverFuzzRL exhibit a stable downward trend, which may mean that their testing strategy gradually saturates over time. The mutation efficiency of these strains is relatively low, and the change is not significant; although they also use reinforcement learning algorithms, their adaptability to specific vulnerabilities is not strong. The efficiency of SelectFuzz and Angora rapidly decreases over time, indicating that they may have identified some vulnerabilities in the early stages but later found it difficult to continuously discover new vulnerabilities, possibly because the strategy was effective in the early stages but not suitable for the long-term detection of deep vulnerabilities.

[figure(s) omitted; refer to PDF]

The experimental results of vulnerability discovery efficiency are shown in Figure 9(b). It can be seen from the figure that DocFuzz finds significantly more vulnerabilities within 24 h than the other fuzz testing methods.

In the first few hours of the experiment, the growth curve of DocFuzz does not show a rapid upward trend compared to that of the other tested methods. However, as time goes on, DocFuzz discovers vulnerabilities that other methods do not find. This result means that DocFuzz is more capable of quickly penetrating the complex parts of the program in design and exploring in the direction of more triggering vulnerabilities. According to Figure 9(b), compared with the fuzz testing method based on reinforcement learning, DocFuzz restarts the reinforcement learning compiler through a feedback mechanism after the reinforcement learning converges, while BertRLfuzz and RiverFuzzRL stop testing prematurely.

8. Conclusion

To solve the problem concerning the use of invariable mutators in traditional fuzzing methods, this paper proposes a directed fuzzing method based on a feedback mechanism mutator, DocFuzz. This method is sensitive to the locations of vulnerabilities while ensuring the syntactic and semantic validity of the generated input. DocFuzz trains a reinforcement learning mutator to better understand the basic principles of vulnerabilities and uses the learned strategies to optimize the mutator, thus enabling the fuzzer to dynamically adjust during the testing process and iterate in the direction that can trigger more vulnerabilities; this allows the mutated test cases to not only reach the locations of vulnerabilities but also quickly trigger vulnerabilities. The experimental results show that DocFuzz can detect specific vulnerabilities more accurately and effectively than other approaches.

In future research, we aim to design more efficient reinforcement learning algorithms to further enhance the efficiency and capabilities of vulnerability detection methods. Additionally, we plan to improve the vulnerability detection performance of the proposed approach by leveraging the feedback mechanism mutator of fuzzers.

Author Contributions

Lixia Xie: conceptualization, methodology, writing–original draft. Yuheng Zhao: methodology, software, writing–original draft, validation. Hongyu Yang: administration, methodology, writing–review and editing. Ziwen Zhao: conceptualization, funding acquisition. Ze Hu: supervision. Liang Zhang: checking, supervision. Xiang Cheng: investigation, funding acquisition.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. U1833107), the Youth Fund Project of Jiangsu Provincial Basic Research Program (Grant No. BK20230558), and the Open Fund of the Civil Aviation Flight Networking Key Laboratory of Civil Aviation University of China (Grant No. MHFLW202304).

Acknowledgments

References

[1] J. Chen, J. Chen, D. Guo, D. Towey, "An Improved Fuzzing Approach Based on Adaptive Random Testing," Proceedings of 2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW 2020), pp. 103-108, DOI: 10.1109/issrew51248.2020.00045, .

[2] C. Daniele, S. B. Andarzian, E. Poll, "Fuzzers for Stateful Systems: Survey and Research Directions," ACM Computing Surveys, vol. 56 no. 9,DOI: 10.1145/3648468, 2024.

[3] X. Zhao, H. Qu, J. Xu, X. Li, W. Lv, G. Wang, "A Systematic Review of Fuzzing," Soft Computing, vol. 28 no. 6, pp. 5493-5522, DOI: 10.1007/s00500-023-09306-2, 2024.

[4] S. Mallissery, Y. Wu, "Demystify the Fuzzing Methods: A Comprehensive Survey," ACM Computing Surveys, vol. 56 no. 3,DOI: 10.1145/3623375, 2023.

[5] B. Wu, F. Zou, "Code Vulnerability Detection Based on Deep Sequence and Graph Models: A Survey," Security and Communication Networks, vol. 2022,DOI: 10.1155/2022/1176898, 2022.

[6] X. Zhu, S. Wen, S. Camtepe, Y. Xiang, "Fuzzing: A Survey for Roadmap," ACM Computing Surveys, vol. 54 no. 11s,DOI: 10.1145/3512345, 2022.

[7] G. Choi, S. Jeon, J. Cho, J. Moon, "A Seed Scheduling Method with a Reinforcement Learning for a Coverage Guided Fuzzing," IEEE Access, vol. 11, pp. 2048-2057, DOI: 10.1109/access.2022.3233875, 2023.

[8] M. Boehme, C. Cadar, A. Roychoudhury, "Fuzzing: Challenges and Reflections," IEEE Software, vol. 38 no. 3, pp. 79-86, DOI: 10.1109/ms.2020.3016773, 2021.

[9] C. Beaman, M. Redbourne, J. D. Mummery, S. Hakak, "Fuzzing Vulnerability Discovery Techniques: Survey, Challenges and Future Directions," Computers & Security, vol. 120,DOI: 10.1016/j.cose.2022.102813, 2022.

[10] M. Zalewski, American Fuzz Lop, 2024.

[11] M. Böhme, V. Pham, A. Roychoudhury, "Coverage-Based Greybox Fuzzing as Markov Chain," Proceedings of 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS’16), pp. 1032-1043, .

[12] S. Rawat, V. Jain, A. Kumar, L. Cojocar, C. Giuffrida, H. Bos, "VUzzer: Application-Aware Evolutionary Fuzzing," Proceedings of 24th Annual Network and Distributed System Security Symposium (NDSS 2017), .

[13] P. Chen, H. Chen, "Angora: Efficient Fuzzing by Principled Search," Proceedings of 2018 IEEE Symposium on Security and Privacy (SP 2018), pp. 711-725, DOI: 10.1109/sp.2018.00046, .

[14] C. Luo, W. Meng, P. Li, "SelectFuzz: Efficient Directed Fuzzing with Selective Path Exploration," Proceedings of 2023 IEEE Symposium on Security and Privacy (SP 2023), pp. 2693-2707, DOI: 10.1109/sp46215.2023.10179296, .

[15] Y. Li, B. Chen, M. Chandramohan, S. Lin, Y. Liu, A. Tiu, "Steelix: Program-State Based Binary Fuzzing," Proceedings of 11th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2017), pp. 627-637, DOI: 10.1145/3106237.3106295, .

[16] L. Bernhard, T. Scharnowski, M. Schloegel, T. Blazytko, T. Holz, "JIT-picking: Differential Fuzzing of JavaScript Engines," Proceedings of 2022 ACM SIGSAC Conference on Computer and Communications Security (CCS’22), pp. 351-364, .

[17] S. Adams, T. Cody, P. A. Beling, "A Survey of Inverse Reinforcement Learning," Artificial Intelligence Review, vol. 55 no. 6, pp. 4307-4346, DOI: 10.1007/s10462-021-10108-x, 2022.

[18] C. Lyu, S. Ji, C. Zhang, "MOPT: Optimized Mutation Scheduling for Fuzzers," Proceedings of 28th USENIX Security Symposium (USENIX Security 2019), pp. 1949-1966, .

[19] X. Zhu, M. Böhme, "Regression Greybox Fuzzing," Proceedings of 2021 ACM SIGSAC Conference on Computer and Communications Security (CCS’21), pp. 2169-2182, DOI: 10.1145/3460120.3484596, .

[20] C. Paduraru, M. Paduraru, A. Stefanescu, "RiverFuzzRL-an Open-Source Tool to Experiment with Reinforcement Learning for Fuzzing," Proceedings of 14th IEEE Conference on Software Testing, Verification and Validation (ICST 2021), pp. 430-435, .

[21] P. Jha, J. Scott, J. S. Ganeshna, M. Singh, V. Ganesh, "BertRLFuzzer: A BERT and Reinforcement Learning Based Fuzzer (Student Abstract)," vol. 38 no. 21, pp. 23521-23522, DOI: 10.1609/aaai.v38i21.30455, .

[22] H. Liang, X. Pei, X. Jia, W. Shen, J. Zhang, "Fuzzing: State of the Art," IEEE Transactions on Reliability, vol. 67 no. 3, pp. 1199-1218, DOI: 10.1109/tr.2018.2834476, 2018.

[23] P. Wang, X. Zhou, T. Yue, P. Lin, Y. Liu, K. Lu, "The Progress, Challenges, and Perspectives of Directed Greybox Fuzzing," Software Testing, Verification and Reliability, vol. 34 no. 2,DOI: 10.1002/stvr.1869, 2024.

[24] C. Aschermann, S. Schumilo, T. Blazytko, R. Gawlik, T. Holz, "REDQUEEN: Fuzzing with Input-To-State Correspondence," Proceedings of 26th Annual Network and Distributed System Security Symposium (NDSS 2019), .

[25] D. Song, J. Lettner, P. Rajasekaran, "SoK: Sanitizing for Security," Proceedings of 2019 IEEE Symposium on Security and Privacy (SP 2019), pp. 1275-1295, DOI: 10.1109/sp.2019.00010, .

[26] A. Potapenko, "Sanitizers. GitHub. Last Modified on 23 August 2024," . https://github.com/google/sanitizers

[27] K. Serebryany, D. Bruening, A. Potapenko, D. Vyukov, "Addresssanitizer: A Fast Address Sanity Checker," Proceedings of 2012 USENIX Annual Technical Conference (USENIX ATC 2012), pp. 309-318, .

[28] V. Ganesh, T. Leek, M. Rinard, "Taint-based Directed Whitebox Fuzzing," Proceedings of 31st IEEE International Conference on Software Engineering, pp. 474-484, DOI: 10.1109/icse.2009.5070546, .

[29] A. H. Ganesh, B. Xu, "A Review of Reinforcement Learning Based Energy Management Systems for Electrified Powertrains: Progress, Challenge, and Potential Solution," Renewable and Sustainable Energy Reviews, vol. 154,DOI: 10.1016/j.rser.2021.111833, 2022.

[30] C. Zhu, J. Yang, W. Zhang, Y. Zheng, "An Algorithm for Multi-Armed Bandit Based on Variance Change Sensitivity," Engineering Research Express, vol. 6 no. 2,DOI: 10.1088/2631-8695/ad4255, 2024.

[31] K. Böttinger, P. Godefroid, R. Singh, "Deep Reinforcement Fuzzing," Proceedings of 2018 IEEE Security and Privacy Workshops (SPW 2018), pp. 116-122, DOI: 10.1109/spw.2018.00026, .

[32] T. M. Moerland, J. Broekens, A. Plaat, C. M. Jonker, "Model-Based Reinforcement Learning: A Survey," Foundations and Trends® in Machine Learning, vol. 16 no. 1,DOI: 10.1561/2200000086, 2023.

[33] Z. Zabihi, A. M. Eftekhari Moghadam, M. H. Rezvani, "Reinforcement Learning Methods for Computation Offloading: A Systematic Review," ACM Computing Surveys, vol. 56 no. 1,DOI: 10.1145/3603703, 2023.

[34] B. Dolan-Gavitt, P. Hulin, E. Kirda, "LAVA: Large-Scale Automated Vulnerability Addition," Proceedings of 2016 IEEE Symposium on Security and Privacy (SP 2016), pp. 110-121, .

[35] J. Song, J. Alves-Foss, "The DARPA Cyber Grand Challenge: A Competitor’s Perspective, Part 2," IEEE Security & Privacy, vol. 14 no. 1, pp. 76-81, DOI: 10.1109/msp.2016.14, 2016.

[36] N. Arnold, "The Patroit Missile Failure. CSE. Last Modified on 23 August 2000," , 2024. https://www.ima.umn.edu/%7Earnold/disasters/patriot.html

Word count: 8867

Show less

Copyright © 2024 Lixia Xie et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/

Abstract

In response to the limitations of traditional fuzzing approaches that rely on static mutators and fail to dynamically adjust their test case mutations for deeper testing, resulting in the inability to generate targeted inputs to trigger vulnerabilities, this paper proposes a directed fuzzing methodology termed DocFuzz, which is predicated on a feedback mechanism mutator. Initially, a sanitizer is used to target the source code of the tested program and stake in code blocks that may have vulnerabilities. After this, a taint tracking module is used to associate the target code block with the bytes in the test case, forming a high-value byte set. Then, the reinforcement learning mutator of DocFuzz is used to mutate the high-value byte set, generating well-structured inputs that can cover the target code blocks. Finally, utilizing the feedback mechanism of DocFuzz, when the reinforcement learning mutator converges and ceases to optimize, the fuzzer is rebooted to continue mutating toward directions that are more likely to trigger vulnerabilities. Comparative experiments are conducted on multiple test sets, including LAVA-M, and the experimental results demonstrate that the proposed DocFuzz methodology surpasses other fuzzing techniques, offering a more precise, rapid, and effective means of detecting vulnerabilities in source code.

Details

Title

DocFuzz: A Directed Fuzzing Method Based on a Feedback Mechanism Mutator

Author

Xie, Lixia¹

; Zhao, Yuheng¹

; Yang, Hongyu²

; Zhao, Ziwen³

; Hu, Ze³

; Zhang, Liang⁴

; Cheng, Xiang⁵

¹ School of Computer Science and Technology Civil Aviation University of China Tianjin 300300 China
² School of Computer Science and Technology Civil Aviation University of China Tianjin 300300 China; School of Safety Science and Engineering Civil Aviation University of China Tianjin 300300 China
³ School of Safety Science and Engineering Civil Aviation University of China Tianjin 300300 China
⁴ School of Information The University of Arizona Tucson 85721 Arizona, USA
⁵ School of Information Engineering Yangzhou University Yangzhou 225127 China; Key Laboratory of Civil Aviation Flight Networking Civil Aviation University of China Tianjin 300300 China

Editor

Yu-an Tan

Publication year

2024

Publication date

2024

Publisher

John Wiley & Sons, Inc.

ISSN

08848173

e-ISSN

1098111X

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1155/int/7931792

ProQuest document ID

3148030426

DocFuzz: A Directed Fuzzing Method Based on a Feedback Mechanism Mutator

Jump to:

Full text

Abstract

Details

Suggested sources