Content area
This dissertation is about using and evaluating large language models (LLM) in automated reasoning for software bug detection and error specification inference in C. LLMs are generative models trained on extraordinarily large amounts of data, which has been largely possible due to the introduction of the transformer architecture. LLMs have been trained on data ranging from natural language text to programming language source code. Models are also trained on tasks beyond basic generation, e.g., reasoning. With the variety of capabilities that these LLMs have shown, there has been an explosion in LLM-based techniques for improving software through applications such as program analysis. This work introduces an interleaved approach that combines static analysis and LLM prompting for error specification inference, as well as an exploration into the state of evaluating LLM-based bug detectors over time.
Error specifications in C refer to the function and any values that are returned upon error. This practice is referred to as the return code idiom and is done because C does not have any built-in exceptions like many other languages. Each program may have different conventions related to the returned values on error, as well as, they may make use of other libraries that practice separate conventions. These various attributes make performing automated reasoning around C error specifications quite challenging. Previous state-of-the-art work has leveraged initial domain knowledge, i.e., an initial set of program facts, to bootstrap the program analysis to learn these new error specifications. However, if this initial domain knowledge does not propagate to all parts of the analyzed program, then some error specifications may be missed. For example, third-party functions are those functions that come from external sources and are not statically linked. In those cases if the static analysis does not have any background knowledge related to this function, then the error specification will be missed. We introduce an interleaved approach that queries an LLM when the static analyzer fails, and feeds these newly learned LLM facts back to the analysis. Our interleaved analyzer improved F1-score and recall, while only slightly decreasing precision when compared to a baseline static analyzer.
LLMs are subject to hallucinating, i.e., making incorrect statements that may appear on the surface to be correct. If imprecision in the analysis is introduced, it may propagate throughout other portions of the analyzed codebase, increasing the overall imprecision significantly. Also, these LLMs are non-deterministic, meaning that there may be variance between experimental runs. To address concerns on potential sources of imprecision in the interleaved analysis, this dissertation will also explore the potential imprecision that occurs in the analysis. The results demonstrate that while some benchmarks demonstrate imprecision propagating from the LLM, this imprecision is not consistent across all benchmarks. It is also noted that in benchmarks where the LLM is heavily influential in inferring error specifications, there is also quite a significant variance in results per experimental run.
With the various LLM-based approaches being introduced for performing bug detection, there is a concern that the evaluation of these approaches is flawed. Many of the most popular benchmarks that exist are either static or infrequently updated. This is a potential problem for LLMs, as newer models will have more recent training knowledge cutoff dates. This is a given, as these models often rely on the context of historical events and need new context as time progresses. With the lack of transparency around the training data sources of these LLMs, this creates concern that some of the data used for evaluation might have been seen previously during training. To address this, we introduced a dataset curation and evaluation pipeline for evaluating the performance of LLM-based bug detectors. We demonstrated our pipeline on curating a dataset of 158 null pointer dereference bugs and evaluated on five LLM-based bug detector strategies with three popular LLMs. The LLM itself was the most significant factor in contributing to successful bug detection, while most bug detector strategies performed similarly. We also evaluated the performance of the LLM-based techniques on bugs fixed before and after the LLM cutoff date. Our results showed that while programs fixed after the cutoff performed lower in terms of detection rate, this result was still statistically insignificant.