Content area

Abstract

Common Weakness Enumerations (CWEs) and Common Vulnerabilities and Exposures (CVEs) are open knowledge bases that provide definitions, descriptions, and samples of code vulnerabilities. The combination of Large Language Models (LLMs) with vulnerability knowledge bases helps to enhance and automate code vulnerability repair. Several key factors come into play in this setting, including (1) the retrieval of the most relevant context to a specific vulnerable code snippet; (2) augmenting LLM prompts with the retrieved context; and (3) the generated artifact form, such as a code repair with natural language explanations or a code repair only. Artifacts produced by these factors often lack transparency and explainability regarding the rationale behind the repair. In this paper, we propose an LLM-enabled framework for explainable recommendation of vulnerable code repairs with techniques addressing each factor. Our method is data-driven, which means the data characteristics of the selected CWE and CVE datasets and the knowledge base determine the best retrieval strategies. Across 100 experiments, we observe the inadequacy of the SOTA metrics to differentiate between low-quality and irrelevant repairs. To address this limitation, we design the LLM-as-a-Judge framework to enhance the robustness of recommendation assessments. Compared to baselines from prior works, as well as using static code analysis and LLMs in zero-shot, our findings highlight that multifaceted LLMs guided by retrieval context produce explainable and reliable recommendations under a small to mild level of self-alignment bias. Our work is developed on open-source knowledge bases and models, which makes it reproducible and extensible to new datasets and retrieval strategies.

Full text

Turn on search term navigation

1. Introduction

A vulnerability is a flaw in software logic or hardware components that can be exploited to compromise the integrity and confidentiality of a system [1]. The number of software and open source projects is expanding rapidly, with about 518 million projects with a 25% year-on-year growth [2], which has led to increased software vulnerabilities [3]. Different methods have been proposed for the repair of software vulnerabilities. The use of traditional approaches to detect and repair vulnerabilities, such as program analysis-based tools and rule-based vulnerability detectors, encounters high false positives [4] and struggles to work for diverse vulnerabilities [5]. Various deep learning-based approaches have also been introduced, leading to improvements in detecting and repairing software vulnerabilities [6,7,8]. LLMs have proven to be effective across various natural language and software engineering tasks, including vulnerability repair [5,9], due to the large corpus of data they are pre-trained on [10].

Despite the growing body of work in the field of software vulnerability, some challenges still exist, such as handling diverse vulnerability data and the context window limitation of LLMs [11]. Works such as that of Zhou et al. [5] have outperformed prior SOTA results [8], with CodeBLEU [12] scores improving from 32.5% to 40.9% and Exact Match scores improving from 16.8 to 20.0.

While existing techniques focus on generating code repairs, they lack sufficient rationale and remain opaque about why specific code segments were repaired, which is useful in guiding downstream processes such as LLM-powered code generation. Knowledge bases such as CWE and CVE repositories offer detailed contexts, rationales, and examples of CWEs and CVEs that can be leveraged to generate recommendations tailored to the vulnerability. These recommendations, when integrated with code generation processes, provide structured guidance to LLMs, improving the relevance and quality of generated repairs. This idea aligns with the broader approach of retrieval-augmented generation (RAG), which combines external knowledge retrieval with generative models to improve output quality [13].

We acknowledge that the retrieval strategies influence the effectiveness of the generated recommendations. To identify the most suitable retrieval strategy for our use case, we explore and evaluate four retrieval strategy alternatives, namely vanilla, metadata embedding, segmented context, and metadata-driven retrieval. These four retrieval strategies are combined with four LLMs, including GPT-4o [14], Llama3-8B Instruct [15], Mixtral 8 × 7B Instruct [16], and CodeLlama 34B [17], to generate recommendations. We compare the quality of the LLM code repair recommendations using zero-shot inferences against the generated quality using each of the four retrieval strategies, by generating code fixes from the recommendations and then evaluating them with CodeBLEU, chrF, and RUBY to quantify performance improvements. We apply a total of 294 CWE types and 10,000 CVE samples from two datasets, BigVul [18] and CVEFixes [19]. In total, about 100 experiment settings are applied to assess retrieval accuracy and recommendation quality. Our observation highlights that the metadata-driven retrieval is the best-performing retrieval strategy for our use case. We also observe that the reference-based metrics assessments are inadequate for differentiating irrelevant repairs that may share surface-level similarities of code syntax with the ground truth. We also use static code analysis to run security tests to measure the ability of the LLMs to repair vulnerable code by comparing the zero-shot performance to the metadata-driven retrieval strategy, using the best-performing model, GPT-4o.

To address the challenges associated with reference-based metrics assessments, we propose using the LLM-as-a-Judge assessment framework to evaluate recommendations according to defined criteria of Relevance, Completeness, Correctness, Identification of Vulnerable Code, and Code Guidance. We compare zero-shot and retrieval strategy performance scores to quantify the added value of each strategy. By scoring recommendations based on these criteria, the LLM judge provides a comprehensive evaluation that aligns more closely with human judgment. We conduct eight experiments with two models to identify the bias tendency of LLMs to prefer their own recommendations. We aim to answer three research questions (RQs).

RQ1: What are the factors of embedding and retrieval that affect the recommendations of code-repairing strategies using LLMs? We consider data characteristics to be the key to influencing the embedding and retrieval quality. We develop and evaluate retrieval strategies to present variations in handling data characteristics. These strategies become the base for quantitatively measuring the retrieval accuracy of embedding strategies.

RQ2: Does the proposed code vulnerability recommendation pipeline operate in a language-agnostic manner? We investigate whether the recommendations generated over different programming languages show consistent quality of repair recommendations without the over- or under-performance of any particular language.

RQ3: How do different evaluation methods affect the assessment of code-repairing recommendations? We investigate the limitations of using reference-based metrics such as CodeBLEU, chrF, and RUBY to assess code repairs based on the LLM recommendations. We also investigate the ability of an LLM-as-a-Judge to directly evaluate the recommendations to overcome the limitations of the SOTA metrics and provide alignment with the human manual inspection process.

Our contributions in this paper are threefold:

Code Retrieval Strategies: We explore and evaluate a set of retrieval strategies, including vanilla, metadata embedding, segmented context, and metadata-driven retrieval, that capture the data characteristics of the knowledge bases, such as token overlapping, metadata uniqueness, syntactical diversity, and semantical similarity. These strategies are evaluated to assess their impact on the quality of recommendations to determine which strategy is best suited for our use case.

LLM-based Pipeline for Vulnerability Repair Recommendations: We propose a retrieval and augmentation pipeline that retrieves domain-specific contextual information from an open knowledge base, augments the retrieved context into LLM prompts, and generates actionable recommendations.

LLM-based Recommendation Assessment: We utilize an LLM-as-a-Judge framework to evaluate recommendations based on criteria aligned with the manual inspection process. Our evaluation spans comprehensive scenarios, multiple LLM models, and a diverse set of metrics, providing a robust assessment of recommendation quality.

This paper is structured as follows: We introduce the background and related code repair works in Section 2. Section 3 discusses the retrieval and augmentation strategies we utilize for context retrieval and our code repair recommendation pipeline. In Section 4, we discuss our experimental setup, including the LLM selection and the establishment of our baselines. In Section 5, we analyze the limitations of using reference-based metrics assessments to evaluate code repair recommendations and discuss the results of the static code analysis. We present the LLM-based assessment in Section 6. Finally, we discuss the guidelines for reproducible adoption in Section 7, highlight the potential threats to validity in Section 8, and draw conclusions in Section 9.

2. Background and Related Works

CVEs are unique identifiers given to publicly known vulnerabilities in software that can be leveraged to compromise the security of an organization’s system. A CVE entry consists of a unique identifier number, a vulnerability description, and related references. There are currently about 291,000 reported vulnerabilities as of August 2025 [20]. CWEs, on the other hand, serve as a classification system for organizing CVEs into general categories [21].

Research in software vulnerability analysis has evolved from rule-based and pattern-based approaches to various deep learning methods and the use of LLMs, improving automated code vulnerability repair. However, challenges still remain in the field. This section examines related works by grouping them into key challenges, namely Imbalanced Datasets, Evaluation Limitations, and LLM Context Constraints, highlighting gaps and how our work addresses them.

Imbalanced Datasets. Studies such as those by Chen et al. [22] and Pearce et al. [23] highlight the challenges posed by small and imbalanced datasets in vulnerability repair, which limit deep learning models’ ability to generalize to less frequent vulnerabilities. Russell et al. [24] also highlight the limitations of existing vulnerability datasets, such as SATE IV, which rely heavily on synthetic examples that fail to capture the diversity of real-world vulnerabilities. Wu et al. [25] address the challenge of underrepresented CWE types by introducing a benchmark designed to increase coverage of less common CWEs. Despite this effort, the study demonstrates that existing LLMs and Automated Program (APR) models struggle to repair many underrepresented CWE types. Zhou et al. [5] attempt to fix the issue of generalization by incorporating parent, child, and sibling CWE types alongside the target CWE type, making the model more robust to under-represented CWEs. The issue of generalizability caused by training on limited datasets still persists. Many approaches rely on small or synthetic datasets, often leading to overfitting to specific benchmarks. Although some works introduce variations or augment existing datasets, these efforts fall short of representing the diversity and complexity of real-world vulnerabilities. Given the limited availability of comprehensive datasets, training a model to handle all CWE and CVE cases is unrealistic.

Instead, our approach employs retrieval and augmentation strategies to retrieve relevant CWE and CVE examples and knowledge dynamically, which are augmented into the LLM prompts to enable them to provide actionable recommendations. Because this approach relies on retrieval rather than model fine-tuning, it minimizes the risk of overfitting to specific vulnerability types. The LLMs are not limited by training data but instead have access to an external knowledge base that can be updated with new vulnerability data without requiring model retraining. This approach supports diversity by allowing access to examples that span multiple CWE and CVE categories.

LLM Context Constraints. In their work, Mashhadi et al. [11] highlight how CodeBERT’s architecture is constrained by a maximum sequence length of 512 tokens, limiting its ability to process longer programs. Islam et al. [26] propose an architecture to address the inefficiency of traditional encoder-decoder models in processing long code sequences for vulnerability repair, mitigating context window limitations to some extent. Joshi et al. [27] highlight Codex’s degraded performance when processing extended code snippets due to context window limitations. Zhou et al. [5] introduce VulMaster, which tackles the context window limitations of Transformer-based models such as CodeT5 by using a divide-and-conquer approach to process lengthy code segments without truncation. Pearce et al. [23] identify context window limitations as a significant constraint, with token limits necessitating code truncation or manual selection of key segments. Berabi et al. [9] tackle context window limitations by using a technique that directs attention to relevant code snippets. While some of these approaches improve context handling to some extent, the issue persists across various domains, hence the many attempts to continuously increase LLM context window sizes.

To address these challenges in the code vulnerability recommendation context, our approach employs an LLM agent that dynamically chunks large code snippets and processes each chunk in parallel. By integrating relevant external context during analysis, the LLM agent can identify vulnerable code snippets and generate actionable recommendations for mitigating vulnerabilities. This strategy ensures comprehensive coverage of vulnerabilities for large code snippets.

Evaluation Limitations. De-Fitero-Dominguez et al. [28] critique reference-based metrics such as BLEU, arguing that these focus on textual similarity rather than functional correctness, calling for more nuanced evaluation approaches. Islam et al. [26] also critique reference-based evaluation metrics, emphasizing their inadequacy in fully capturing the quality of generated code for vulnerability repair. A similar concern was raised by Evtikhiev et al. [29] for using these metrics to evaluate code generation models, and they also highlight how these metrics fail to align with human judgment. Fu et al. [30] evaluate ChatGPT for vulnerability tasks such as code repair and critique reference-based evaluation metrics, noting their inadequacy in measuring the correctness of repair patches. However, these metrics remain widely used despite their limitations in evaluating the relevance, completeness, and contextual applicability of LLM-generated text. For evaluating LLM-generated text, some studies, such as Ahmed et al. [31], use human evaluation, which remains the gold standard for assessing the quality of LLM-generated text. However, it comes with significant economic and scalability challenges. LLMs can perform evaluations at a fraction of the cost compared to human experts and do not undergo challenges such as inconsistency due to fatigue, subjective biases, and prolonged evaluation periods due to large volumes of content [32].

To overcome these challenges, recent research has adopted automated evaluation approaches that leverage LLMs as evaluators by offering superior consistency by applying evaluation criteria uniformly across large datasets [33,34,35]. Furthermore, studies on frameworks such as G-Eval [36] have shown that integrating chain-of-thought reasoning into LLM evaluations can yield results that closely align with human judgment. This evidence supports the idea that well-crafted prompts can enable general-purpose large foundation models to serve as effective judges without the need for additional fine-tuning. Recent research shows that fine-tuning approaches, for example, using techniques such as knowledge distillation [37], can improve the performance of smaller or open-source models, allowing them to approach evaluations of larger models such as GPT-4.

However, in our case, we rely on a model from the GPT-4 family, GPT-4o, which has already demonstrated good performance and a high correlation with human judgment [36]. Given this strong inherent performance, we decide not to fine-tune GPT-4o and instead provide all the necessary context and evaluation rubrics in structured prompts for the LLM to follow. However, the LLM-as-a-Judge approach is not without challenges. Self-alignment bias, for instance, can result in a preference for responses generated by the same LLM [38].

For a comprehensive assessment of code repair recommendations, we leverage reference-based metrics assessments, static code analysis, and an LLM-as-a-Judge framework. The LLM-as-a-Judge approach scores recommendations based on five defined criteria, including relevance, completeness, and correctness. This hybrid evaluation strategy provides deeper insights into the usefulness of recommendations. We also study potential biases in LLM evaluations, offering a more comprehensive assessment of code vulnerability repair.

3. Retrieval and Augmentation Strategy for Code Repair Recommendations

The system is designed to take the vulnerable code, its associated CWE and CVE information, and the retrieved contextual information as input to generate a corresponding recommendation for repairing the vulnerability. The methodology is divided into three distinct stages to address the key challenges of code vulnerability repair presented in Figure 1: Stage One: retrieval and augmentation strategies, which produce embeddings for accurate and relevant retrieval; Stage Two: pipelines for recommendation generation, which produce recommendations and code repair, where we measure performance using the reference-based metrics evaluation and perform static code analysis to measure the code repair rate; and Stage Three: LLM-based assessment, which evaluates recommendations using an LLM based on defined assessment criteria followed by a comprehensive evaluation, which analyzes the limitation and solution of reference-based metrics, which is LLM-based recommendation assessments.

3.1. Data Characteristic Analysis

The data characteristics of the vulnerability datasets drive the design of the retrieval strategies. We have classified mainly four representative categories based on the data characteristics, namely, token overlaps due to shared CWEs, metadata uniqueness, syntactical diversity, and semantic similarity. For each characteristic, we outline its potential impact on retrieval quality and describe how the four retrieval strategies explore different ways of addressing it.

Token overlaps. Due to shared CWEs, entries in the datasets of BigVul and CVEFixes share common tokens, which may lead to embeddings with closely approximating values and thus reduce retrieval precision. For example, CVEs belonging to the same CWE share characteristics such as the CWE name, description, extended description, potential mitigation, and notes. By calculating the cosine similarity between the vector embeddings, we quantitatively assess the degree of token overlap and embedding similarity. Figure 2 demonstrates the similarity of the embedding of ten CVEs associated with CWE-189. As discussed in [39], while token overlap improves metrics such as perplexity and F1 scores, models may overpredict semantic similarity for text pairs with high surface-level token overlap but of different meanings. The vanilla strategy embeds all fields together, making it the most exposed to this repetition, while the metadata embedding strategy separates fields but may still retain high-frequency tokens within individual fields. The segmented context strategy reduces the amount of repeated content by embedding distinctive fields such as CWE/CVE identifiers, and the metadata-driven retrieval strategy does similar but also applies filtering before ranking, which can narrow the results and indirectly limit the impact of token overlap.

Metadata uniqueness. The CVEFixes and BigVul datasets contain structured attributes, including CVE IDs, CWE identifiers, programming language labels, and textual descriptions of the CVEs. Each CVE ID represents a specific software vulnerability (such as CVE-2009-1194), while a CWE ID categorizes the underlying weakness type (such as CWE-89 for SQL injection). This creates unique metadata pairs. For example, CVE-2009-1194 is linked to CWE-89. Recent findings [40,41] indicate improved retrieval precision when metadata is used to filter and refine search results. The vanilla strategy does not explicitly leverage these fields beyond including them in the text, while the metadata embedding strategy encodes each as a separate vector to explore whether direct representation improves retrieval. The segmented context strategy uses only the most distinctive identifiers in the embedding step and keeps other details as metadata, and the metadata-driven retrieval strategy takes this further by applying explicit metadata-based filters before semantic ranking.

Syntactical diversity. We include the CVEFixes dataset that contains about 30 distinct programming languages and the BigVul dataset, which contains only C/C++ code samples. Prior works [22,28], however, restrict CVEFixes to its C/C++ subset when evaluating vulnerability repair approaches. Figure 3 presents the distribution of programming languages within the CVEFixes dataset, highlighting the syntactical diversity that retrieval methods must accommodate. All four retrieval strategies are applied without language-specific adjustments. They rely on generic embeddings or metadata filtering that treat all code as sequences of tokens. The vanilla strategy and metadata embedding strategy both preserve the full language-specific syntax in embeddings, segmented context reduces embedded language-specific content by focusing on identifiers, and metadata-driven retrieval constrains results based on metadata while still allowing retrieval across different programming languages.

Semantic similarity. CWE classes sharing the same underlying security weakness may contain vulnerabilities through various syntactic and structural implementations. For example, CWE-79 (Cross-Site Scripting) has occurred in JavaScript, Python, or PHP code, each with different function calls, escaping mechanisms, and context-aware mitigation strategies. Despite these distinctions of language implementation, the vulnerabilities are semantically similar. The vanilla strategy, by embedding the full context, has the broadest opportunity to capture such similarities if the embedding model recognizes them. The metadata embedding strategy might capture conceptual links if identifiers and descriptive fields preserve meaning, while segmented context narrows embeddings to minimal identifiers but retains full descriptions and code in metadata for later use. The metadata-driven retrieval first restricts the pool to candidates with matching identifiers, increasing the likelihood that remaining examples share the same conceptual vulnerability before applying semantic ranking.

3.2. Retrieval and Augmentation Strategies

The retrieval strategies in our work are designed to explore whether varying the balance between embedded content and metadata filtering can influence retrieval precision in vulnerability repair tasks. Each retrieval strategy varies in the way context fields from CVE and CWE data are represented. Some embed all fields together, others separate or minimize what is embedded, and some use metadata filtering before ranking. This variation is motivated by the characteristics of the datasets used, as discussed in the previous section. These four retrieval strategies were selected to reflect different ways of combining embedded content and metadata filtering based on dataset characteristics. They are meant to show how retrieval design choices affect repair quality, rather than to cover every possible method. While Section 3.1 details the specific mapping between these characteristics and each retrieval strategy, in this section, we illustrate the operational differences and design rationale, supported by workflows for each approach.

3.2.1. Vanilla Strategy

The vanilla strategy serves as the baseline for comparison. In this strategy, all fields from the CWE and CVE descriptions, along with code examples from the dataset, are concatenated into a single context string. This string is embedded as both dense and sparse vectors, and the embedding object is stored in the vector database. At query time, the CVE and CWE identifiers of the target vulnerability are included in the query string, which is embedded and used to retrieve the top-k most similar items from the database. This workflow is further illustrated in Figure 4.

This strategy is simple to implement and maximizes recall by ensuring that no potentially relevant information is excluded from the embeddings. However, it is also sensitive to token overlaps, particularly when CVEs sharing the same CWE share large portions of text, such as names, descriptions, extended descriptions, and potential mitigations. Such overlaps can lead to highly similar embeddings, potentially reducing retrieval precision.

3.2.2. Metadata Embedding Strategy

The metadata embedding strategy restructures the input by representing each relevant CWE and CVE field as a separate key–value pair. Instead of concatenating all fields into a single text, we embed each key–value pair separately. The resulting vectors, along with their associated field metadata, are stored in the vector database. This approach attempts to capture distinctions between fields that may be obscured when merged into a single context string. At query time, the CVE and CWE identifiers of the target vulnerability are included in the query string, embedded and then matched against the stored vectors. The workflow for the metadata embedding is further illustrated in Figure 5.

Because each field is embedded separately, fields with similar wording across different vulnerabilities may still produce closely aligned embeddings, potentially reducing retrieval precision.

3.2.3. Segmented Context Strategy

The segmented context strategy reduces the amount of embedded data by selecting only the most essential fields, such as a subset of CWE and CVE data, such as their IDs, that are directly relevant to identifying the vulnerability. These selected fields are concatenated into a single context string, while the remaining metadata, such as descriptions and code snippets, are stored separately and not embedded. This approach aims to explore whether limiting the embedded content can help reduce excessive token overlaps without discarding relevant data that may still be useful for filtering. The process for the segmented context strategy is further illustrated in Figure 6.

For this strategy, the reduced context string is embedded as dense and sparse vectors and stored in the vector database together with the unembedded metadata. At query time, the CWE and CVE IDs are embedded and used to retrieve the top-k most similar entries. However, limiting the embedded content may also omit relevant data that could otherwise contribute to semantic matching.

3.2.4. Metadata-Driven Retrieval Strategy

The metadata-driven retrieval strategy builds on the segmented context approach but adds a filtering stage during query execution. As with segmented context, the embedded content is limited to a reduced set of key properties, such as the CWE and CVE identifiers, to help minimize token overlap in the resulting embeddings. All other data is stored as metadata alongside the embeddings.

During retrieval, the system first queries the vector database using the embedded CWE and CVE IDs. Then, it applies metadata filters to narrow the results before ranking them by embedding similarity as illustrated in Figure 7. This strategy explores whether combining similarity matching with metadata filtering can improve retrieval precision on the vulnerability data. However, overly strict filters may exclude relevant but non-identical matches.

3.3. The Code Repair Recommendation Pipeline

The recommendation pipeline encompasses relevant context retrieval, code preprocessing, and recommendation generation.

3.3.1. Relevant Context Retrieval

The query of vulnerable code invokes the pipeline to retrieve relevant contextual information from the knowledge base. This includes CWE and CVE descriptions, examples of vulnerable and fixed code, metadata, and any additional details stored using the defined retrieval strategies. For each dataset of the knowledge base, we measure the retrieval accuracy of each strategy. The results are presented in Figure 8. All the retrieval strategies perform well on the CWE Mitre knowledge base [21]. The CWE Mitre does not have significant token overlaps because it contains distinct information and descriptions of each CWE.

In contrast, CVE samples of the same CWE types in BigVul and CVEFixes share descriptions, extended descriptions, potential mitigations, and notes. This causes the issue of shared tokens across different samples, leading to retrieval inaccuracies. To solve this problem, the metadata-driven retrieval and segmented context strategies apply embeddings of the unique fields only and filter retrieval results using the metadata (metadata-driven retrieval only). Both strategies demonstrate high retrieval accuracy, but overall, the metadata-driven retrieval strategy emerged as the best-performing strategy.

Summary: The retrieval and augmentation strategies are designed based on specific data characteristics found in the selected code vulnerability datasets, BigVul and CVEFixes. The metadata-driven retrieval strategy emerged as the best-performing strategy, demonstrating superior retrieval performance by effectively capturing semantic relationships while minimizing token overlap issues. We design a code repair recommendation pipeline that is both programming language-agnostic and model-agnostic, making it extensible for various programming languages and LLMs. To quantitatively validate these retrieval strategies, a baseline must be established with which to compare the code repair recommendation performance of each retrieval strategy.

3.3.2. Code Pre-Processing and Chunking

Figure 9 demonstrates the token length distribution within our datasets, categorized into token counts less than 8k and exceeding thresholds of 8k, 16k, 32k, and 128k tokens. These thresholds correspond to the token processing limits of our selected models. The CVEFixes dataset has the distribution as 22.10% of instances over 8K tokens (token limit of Llama-3 8B Instruct), 11.83% over 16K tokens (token limit of CodeLlama), 5.52% over 32K tokens (token limit of Mixtral 8 × 7B Instruct), and 0.84% exceeding 128K tokens (token limit of GPT-4o). In contrast, the BigVul dataset has the token length mainly below 8K with only a negligible portion of instances, approximately 0.12%, above 8K and 0.03% above 16K tokens.

For code snippets that exceed the LLM’s context window, we use an LLM agent alongside a recursive text splitter from Langchain [42] to split the code into smaller chunks. Our design is illustrated in Figure 10. An LLM agent has complex reasoning capabilities that enable it to break down complex problems and execute tasks in logical steps with the aid of tools [43]. Each chunk is processed in parallel, allowing the agent to handle large snippets without exceeding the model’s context limit. Each parallel process uses an LLM-powered tool that is prompted to find vulnerabilities based on the context retrieved from the vector database and outputs any snippet of vulnerable code. Each tool outputs either a snippet of vulnerable code or a “Not vulnerable” response. Finally, the detected vulnerable snippets are aggregated and combined with the retrieved context and fed back to the LLM to provide an actionable recommendation for repairing the vulnerable snippets.

3.3.3. Recommendation Generation

We augment the LLM prompt with the retrieved CWE/CVE information, code examples, and the vulnerable code. We use few-shot prompting with four LLMs, including GPT-4o, Llama-3 8B Instruct, Mixtral 8 × 7B Instruct, and Codellama 34B, to recommend mitigation steps to repair the vulnerable code. Each recommendation comprises an issue description, the repair recommendation, and the code fix.

4. Experimental Setup

This section discusses the experimental design of our study. We first establish the baselines using zero-shot variants of our selected LLMs and then evaluate them to measure their performance. The setup covers model selection, datasets, and metrics, as well as the use of static code analysis tools and design considerations.

4.1. Baseline Establishment

The baseline is essential to evaluate the recommendation of vulnerable code repair. We consider the baseline development with the following aspects: (1) selection of LLMs; (2) pipelines using static code analysis and zero-shot learning with an LLM; and (3) metrics for evaluating model performance, illustrated in Figure 11.

Segment one of Figure 11 highlights the static code analysis process. First, we use a static code analyzer to enumerate vulnerabilities identified within the test datasets. The main purpose of this static analysis is to measure how effectively the generated fixes resolve the identified vulnerabilities. The results are compared with recommendations generated from the zero-shot learning pipeline and pipelines augmented by retrieval-based strategies.

Within segment two of Figure 11, we input the vulnerable code snippets into an LLM operating in a zero-shot setting, meaning the LLM receives no additional context or guidance beyond the snippet itself. The LLM then generates recommendations intended to address the identified vulnerabilities. We subsequently generate code fixes based solely on these zero-shot recommendations and evaluate their practical effectiveness through static analysis. This provides an initial performance baseline by measuring how effectively these fixes resolve the identified vulnerabilities. We evaluate the syntactic and semantic quality of the generated repairs using standard metrics, including CodeBLEU, chrF, and RUBY. These evaluations establish the baseline metric scores reflecting the LLM’s default capability to generate repairs similar to the ground truth fixes without enhanced context retrieval.

Within segment three of Figure 11, we utilize the LLM-as-a-Judge approach to assess the recommendations of the zero-shot LLM based on criteria specifically defined for vulnerability repair as relevance, correctness, completeness, identification of vulnerability code, and code guidance. These scores reflect the quality metrics aligned with human evaluation processes. Together, these steps establish multi-dimensional baselines crucial for subsequent comparative analyses of the retrieval strategies.

4.2. The Each-Choice-Coverage Strategy of Selecting LLMs

In our framework, we apply an each-choice-coverage strategy to select a diverse set of models representing a spectrum of available LLMs. Rather than using multiple models with similar characteristics, we deliberately choose one representative model from each major architectural category, including High-Performance Proprietary models, Instruction-Tuned Open-Source models, Mixture-of-Experts (MoE) models, and Code-Specialized models.

4.2.1. High-Performance Proprietary Models

Gemini [44], Claude 3 [45], and GPT-4o are in this category. We choose GPT-4o with a context window of 128K. GPT-4o has outperformed code-specialized models such as CodeLlama and DeepSeek-Coder on coding benchmarks [46] for handling complex code repair tasks [14].

4.2.2. Instruction-Tuned Open-Source Models

Models in this category include Alpaca, Vicuna [47], and Llama3 8B Instruct. We select Llama3 8B Instruct, which is an open-source model with a context window of 8K, trained in multilingual text and code, fine-tuned with human feedback, and optimized for instruction-following [15]. Instruction-following is important in our framework since the models are given prompts with enriched context to instruct the LLMs in code repair and recommendation generation.

4.2.3. Mixture-of-Experts Model

Models in this category include Grok [48], GlaM from Google [49] and Mixtral 8 × 7B Instruct. Mixture of expert (MoE) models offer the trade-off between capability and speed. The MoE architecture specifically enables efficient handling of diverse tasks, improving model responsiveness and recommendation accuracy [50]. We select the Mixtral 8 × 7B Instruct model that outperforms Llama 2 70B on math and coding benchmarks. The fine-tuned instruction version outperforms GPT-3.5 Turbo, Claude 2.1, and Meta’s 70B chat model on human evaluation benchmarks [16]. Its 32K token context is valuable for reviewing the retrieved context and large code snippets.

4.2.4. Code-Specialized Models

This category covers code-specialized models that excel in code synthesis and infilling. The models that belong to this category are CodeT5 [51], DeepSeek-Coder [46], and CodeLlama 34B, a state-of-the-art, code-specialized model with a 16K context length [17]. We select CodeLlama 34B because it offers expertise in programming languages and debugging. DeepSeek-Coder’s capabilities overlap with CodeLlama.

4.2.5. Extension to Other Models

The each-choice-coverage strategy can be extended to other LLMs in a programmable way. Given the model-agnostic nature of our framework, the LLM utilized in Stage 2 of Figure 1 can be replaced by alternative models. Tools such as Ollama provide the plugins and adapters to facilitate the run-time switching of LLMs. Such a deployment requires an inference server platform that is capable of container orchestration, fast servers for inferencing, model management, and monitoring. The scale of the computing capacity demanded is beyond the scope of this paper.

4.3. Reference-Based Metrics Evaluation of the Repaired Code

We conduct experiments using four LLMs, including GPT-4o, Llama-3 8B Instruct, Mixtral 8 × 7B Instruct, and CodeLlama-34B, on two datasets, CVEFixes and BigVul, using CodeBLEU, chrF, and RUBY for assessment. Though the CWE Mitre website data is included in the knowledge base, it is excluded from the datasets because it does not contain sample vulnerable code and its fixes, only various CWE details.

For each dataset, we split the data into two: a retrieval corpus we embed and use to populate the vector database, and an evaluation set used at inference time. We group all vulnerability samples by their CVE identifier, remove exact duplicates using dataset-specific identifiers (file hash for CVEFixes and commit ID for BigVul), and then create the split at the instance level. This approach ensures that while the same CVE vulnerability type may appear in both splits, no identical code snippet appears in both sets, and test instances represent genuinely different implementation contexts across various programming languages or distinct code patterns. While our CVE-based split and duplicate removal avoid exact instance overlap, we acknowledge that direct pretraining contamination and indirect synthetic-data leakage remain possible as described by Matton et al. [52]. The retrieval corpus is converted to embeddings and indexed in a vector database as depicted in Stage One of Figure 1, while the evaluation set provides ground truth for evaluation in Stage Two of the recommendation pipeline.

Langchain is utilized throughout the process for chunking, prompt templating, and LLM text generation. Combined with zero-shot inferences and the four retrieval augmentation strategies, a total of 40 reference-based metrics experiments are performed on an AWS virtual machine equipped with 4 cores, 24 GB of GPU, and 16 GB of RAM.

4.4. Supervised Fine Tuning vs. Retrieval Strategies

As previous code vulnerability studies such as that by Jiang et al. [53] have demonstrated, while supervised fine-tuning can lead to performance gains, it also incurs substantial computational and data-related costs. They use LoRA (Low-Rank Adaptation) for lightweight fine-tuning on LLMs such as Llama3-8B and CodeLlama-7B by freezing most of the model parameters and updating only a small subset. Due to resource constraints, they also excluded vulnerable code sequences longer than 1024 tokens, yet the fine-tuning process still required a 24-core processor and a 48 GB GPU.

In contrast, our retrieval strategies approach focuses on the data rather than fine-tuning. We conduct our experiments in a less computationally expensive environment, utilizing only a 4-core processor, a 24 GB GPU, and 16 GB of RAM. Additionally, we support longer vulnerable code sequences exceeding 1024 tokens, as our method does not involve fine-tuning, thereby avoiding the computational overhead and training instability associated with handling exceptionally long sequences in fine-tuned models.

By leveraging structured external knowledge of the vulnerabilities, we enrich the LLM’s context at inference time to improve its code repair recommendations without incurring the substantial computational demands of supervised fine-tuning. This strategy has shown promising results in recent work, such as Du et al. [54], who introduced the Vul-RAG framework, outperforming fine-tuned baselines in vulnerability detection tasks by dynamically retrieving and injecting relevant vulnerability knowledge from a constructed vulnerability knowledge base.

5. Identifying Limitation of Metrics Based Assessment

The purpose of this assessment is to measure the quality of the repaired code generated as a result of the recommendations. One direct usage of the recommendation is to serve as an input for downstream LLMs to generate relevant and accurate repaired code. The hypothesis is that a relevant and precise recommendation should result in more alignment of the repaired code with the ground truth, thus resulting in higher assessment scores. Metrics such as CodeBLEU, chrF, and RUBY measure the syntactic correctness and structural similarity of the generated repairs compared to the ground truth. The experiment results for the reference-based metrics assessment are shown in Table 1.

However, the assessment using these metrics may be inadequate, as multiple valid but syntactically different solutions may exist. For instance, a null pointer issue may be fixed by either adding an inline if condition to perform the null check or by creating a reusable function to handle null checks. To better understand these shortcomings, we conducted a manual inspection of the evaluation results to identify cases where the metrics fail to accurately reflect the true repair quality.

Summary: The top-performing retrieval strategy, metadata-driven retrieval, leads to higher metrics scores of recommendations across four LLMs. The scores of samples in CVEFixes are generally lower than those of the BigVul samples, but still within the range of SOTA benchmarking results in Zhou et al. [5]. CVEFixes has a mix of code from about 30 languages, whereas BigVul only contains C/C++ vulnerable code. The diverse syntax is a factor that influences LLMs in generating recommendations. The results highlight that retrieval strategies contribute to the quality of the recommendation of code repair.

5.1. Manual Inspection for Metrics Limitation

We perform a manual inspection of the evaluation results by randomly selecting 450 samples from each dataset, resulting in 900 inspection samples. Each sample requires approximately 10 min to review across the four retrieval strategies, accumulating a total inspection time of around 20 days. Since the purpose of manual inspection is to expose assessment limitations, the current sample size satisfies this objective. Three reference scenarios are outlined to test the adequacy of these metrics’ assessment.

5.1.1. Accurate Retrieval with a Relevant Repair

In this scenario, all four retrieval strategies accurately retrieve the correct vulnerability data. The LLM proposes a relevant repair based on the retrieved context and the vulnerable code. This scenario covers 43.2% of the inspection samples. CodeBLEU, chrF, and RUBY tend to perform well here across all retrieval strategies. However, a repair could potentially receive a lower score despite being a viable solution due to its syntactical difference from the ground truth.

5.1.2. Inaccurate Retrieval with an Irrelevant Repair

The less accurate retrieval strategies, vanilla and metadata embedding, often retrieve the inaccurate vulnerability context and suggest an irrelevant recommendation. In one instance, the vanilla strategy retrieves data for CWE-419 (Unprotected Primary Channel) instead of CWE-416 (Use after free). Therefore, the generated recommendation addresses CWE-419 rather than CWE-416. However, metrics of CodeBLEU, chrF, and RUBY produce scores of 0.84, 0.90, and 0.64, respectively, which are higher than the assessment scores of 0.69, 0.81, and 0.53 derived from the accurate retrieval strategy of metadata-driven retrieval. 45.3% of the samples we have examined have at least vanilla or metadata embedding strategies that produce inaccurate retrieval contexts. We observe that the metrics CodeBLEU, chrF, and RUBY are not able to identify the irrelevant repairs due to incorrect recommendations compared to the accurate recommendations of segmented context, and metadata-driven retrieval strategies.

5.1.3. Inaccurate Retrieval with No Repair

Given an inaccurate retrieval context, the LLM does not propose a recommendation since the vulnerability in the context cannot be resolved with the retrieved data. This results in a generated recommendation without actionable steps, thus no modification to the vulnerable code. This is problematic because there are cases where only single-line or few-line modifications are documented in fixes from the ground truth. Even without any repair to the code, the recommendation generated by the LLM can still have a high syntactic similarity to the ground truth. These metrics still assign very high scores to code snippets generated as a result of recommendations originated from the vanilla and the metadata embedding strategies, despite their lower retrieval accuracy. The more accurate retrieval strategies manage to retrieve the correct data in most cases with good metric scores. 11.5% of the samples we have examined have at least one of the four retrieval strategies that belong to this scenario.

Summary: The three reference scenarios under manual inspection have revealed that reference-based metrics assessments are inadequate to faithfully assign scores to code repairs resulting from incorrect or low-quality recommendations given irrelevant retrieval context. This raises a question: can LLMs approximate the manual inspection process, capturing the relevance and correctness of these recommendations in alignment with human decisions?

5.2. Assessment Using Static Code Analysis Tools

We use static code analysis to measure whether LLM-generated repairs sufficiently mitigate code vulnerabilities. Reference-based metrics such as CodeBLEU, chrF, and RUBY evaluate textual similarity but cannot confirm whether a repair eliminates the security flaw or vulnerability. We utilize static code analysis as a security-centric validation method, serving purely as an evaluation-time vulnerability detection step to assess vulnerabilities before and after code repairs. We select the CVEFixes and BigVul datasets, two common code vulnerability datasets used extensively in code repair studies [22,28].

Prior work by Pearce et al. [23] uses CodeQL [57] on their own synthetically generated dataset to verify whether vulnerabilities are present in the snippets before applying repairs and whether vulnerabilities are resolved afterward. However, for languages such as C/C++, CodeQL requires error-free compilations, which poses a challenge for datasets such as BigVul and CVEFixes. These two datasets contain isolated C/C++ snippets with missing dependencies, variables, and functions. Attempts to bypass compilation lead to unreliable static analysis outcomes. Instead, we use Snyk [58], a static code analysis tool that does not strictly require full code compilation. We observe Snyk has outperformed CodeQL and SonarQube [59] on vulnerability detection in the datasets of this paper. Hence, the role of Snyk in our experiments is to perform pre- and post-repair vulnerability detection solely for evaluation purposes to measure the repair success rate, establishing a baseline to assess LLM-generated vulnerability repairs.

We acknowledge that while static analysis allows us to estimate repair success on incomplete code snippets, it does not guarantee functional correctness. A better approach would be to compile or run tests on the code snippets. However, compiling and running test cases to ensure functional correctness is not feasible in our setting, as the datasets provide only partial code fragments rather than complete, executable programs. Consequently, existing SOTA works leveraging these datasets, such as VulMaster [5] and VulRepair [8], also rely solely on automated metrics, such as CodeBLEU or Exact Match for evaluation.

To evaluate our approach, we conduct static analysis on C/C++ samples from both the BigVul and CVEFixes datasets. For each vulnerable snippet, we generate a corresponding C/C++ file that is then scanned for vulnerabilities. The detected vulnerabilities are fed to both a zero-shot LLM and a retrieval-augmented LLM to generate repairs, which are subsequently re-analyzed to measure the repair rate. Our experiments compare the best-performing model, GPT-4o in a zero-shot configuration, against the most effective retrieval-based method, the metadata-driven retrieval strategy. Table 2 shows the results of the static analysis.

Summary: The findings reveal improvements in vulnerability mitigation when using the metadata-driven retrieval strategy over the zero-shot approach. Before repair, Snyk detected 459 security issues across 365 files in BigVul and 4145 issues across 823 files in CVEFixes. Some files contained multiple vulnerabilities, with some cases exhibiting up to 19 distinct security flaws, such as path traversal, buffer overflow, and integer overflow. Post-repair analysis shows that retrieval-based approaches improve fix success rates considerably compared to the zero-shot approach. Using the metadata-driven retrieval strategy, we see the repair success rate for the BigVul dataset increasing from 28% to 52.6%, with resolved issues rising from 33% to 51.2%. With CVEFixes, we see improvements from 58.4% to 85.3% files fixed and 73% to 92.3% issues resolved. These highlight the practical benefits of retrieval-augmented LLMs using efficient retrieval strategies to mitigate code vulnerability.

5.3. Comparison with SOTA

We further compare our best-performing retrieval strategy, the metadata driven retrieval (MDR) strategy, combined with our selected models against VulMaster [5] and the baselines used in their work. VulMaster reports results on CVEFixes and BigVul using Exact Match (EM) and CodeBLEU. Table 3 shows the results of the comparison.

VulMaster achieves the highest EM score, indicating it is better at producing exact reference patches from the benchmark. However, EM only rewards verbatim matches and does not consider similar or alternative fixes that address a given vulnerability. In practice, multiple correct repairs may exist for the same flaw (e.g., equivalent guard placement, reordered control flow). Such fixes may be relevant and effective, yet EM will score them as incorrect simply because they are not exactly similar to the reference repair. Metrics such as CodeBLEU, chrf, and RUBY [29] consider factors such as character n-gram overlap, token overlap with syntax-aware components, and stylistic or quality aspects specific to code evaluation. They are therefore less strict than EM, which explains the higher scores our MDR+LLM systems achieve on these measures. At the same time, high CodeBLEU, chrF, and RUBY do not guarantee that the vulnerability is repaired. As we show in Section 5, these metrics can assign high scores to irrelevant or no-change outputs. Likewise, a high EM suggests exactness on the benchmark patch but does not consider alternative correct repairs. Motivated by these gaps, we explore the use of an LLM-as-a-Judge framework in Section 6 to assess relevance, completeness, correctness, identification of vulnerable code, and code guidance, aligning evaluation more closely with real repair quality.

6. LLM-Based Assessment

From existing SOTA works [34,35,36], LLMs have shown promise in serving as evaluators for various tasks, being able to understand context, process criteria, reason, and assign scores in a manner that mimics human evaluators. We adopt the LLM-as-a-Judge approach following prior studies showing that LLMs can reliably evaluate generated outputs when guided by structured criteria or rubrics. Following this approach, we define the five evaluation criteria for code vulnerability repair recommendations evaluation to be used by the LLM in this section. We evaluate the LLMs’ assessments with manual inspection results and evaluate their reliability by checking consistency between two independent judges (GPT-4o and Gemini).

6.1. Evaluation Criteria and LLM Metrics for Code Repair Recommendations

This section details the criteria employed to evaluate the quality and utility of code repair recommendations generated by the LLM using a carefully selected set of metrics: Relevance, Completeness, Correctness, Identification of Vulnerable Code, and Code Guidance. These metrics are chosen to complement reference-based metrics by focusing on human-aligned qualitative aspects of repair quality that automated metrics often overlook. Each criterion was motivated by prior LLM and program repair studies that highlight the importance of these dimensions for evaluating code vulnerability repair.

6.1.1. Relevance

Relevance captures whether the recommended fix directly addresses the specified vulnerability in the given context. This means that the recommendation must be applicable to the problem rather than being a generic suggestion. Peng et al. [65] highlight the importance of relevance by noting that a patch generated by an LLM must match the contextual needs of the vulnerability rather than merely fixing symptoms. Industry guidelines for evaluating LLM responses include relevance as a criterion, defined as how directly or accurately an LLM answers a query [66,67,68].

6.1.2. Completeness

Completeness measures how well a model’s response covers all the relevant information provided in the context [69,70]. Our rubric for completeness ensures that the recommendation aspects of the vulnerability are without exploitable gaps. In software vulnerability, a fix that only covers part of the vulnerability may still leave the system open to attack [71,72].

6.1.3. Correctness

Correctness measures whether an LLM output is factually accurate based on the ground truth [66,68,73]. Unlike other classification tasks with unique ground truth, code repair can have multiple valid fixes. In this case, correctness addresses the technical and functional accuracy of the recommendation based on the provided context. Prior work on automatic program repair [74] shows that even patches that pass automated tests can sometimes be semantically incorrect. Our approach for evaluating the correctness of the recommendation is based on human judgment criteria that look beyond semantic similarity and assess whether the logic for repairing the vulnerability is sound and functionally appropriate, given the vulnerability context.

6.1.4. Identification of Vulnerable Code

This metric was incorporated to assess the model’s ability to localize the vulnerable code snippet based on the provided context. This metric is motivated by findings that even if a fix is imperfect, correctly localizing the vulnerable code is valuable [74]. In the context of code repair, prior works also evaluate whether a model can locate the vulnerable portion of the code when suggesting a fix [75]. Hence, this metric assesses whether the recommendation correctly highlights the vulnerable portions of the code.

6.1.5. Code Guidance

In addition to code repair, we expect the code guidance to be provided in detail through the LLM’s recommendation to explain the changes and rationale behind the repairs. Code guidance benefits downstream tasks or developers who utilize the code repair recommendation. Research on code critique by LLMs also includes dimensions such as clarity and practicality of the suggested changes [76].

Collectively, these criteria provide a qualitative assessment for the LLM-generated code repair recommendations, aligned with human-centric evaluation methods. Table 4 describes the details of the scoring rubric used to quantify the performance of the LLM output against each criterion discussed above.

6.2. The Pipeline of LLMs Assessment

First, we develop the task prompt under each criterion and then use a chain of thoughts to generate the evaluation steps, as shown in stage three—LLM-based assessment—in Figure 1. The task prompt informs the LLM of the goal. For example, the task prompt for relevance is presented below:

You will be given a recommendation written for repairing a code vulnerability. Your task is to evaluate the recommendation on Relevance, ensuring that the solution is appropriate for the context of the vulnerability.

We augment the task prompt and the evaluation criteria in Table 4 into the final prompt for the LLM to generate the evaluation steps by itself. We prompt the LLM to come up with a series of logical steps to break down the entire assessment process. Hence, we create a chain of thought for the LLM-as-a-judge to follow when performing the assessment. For example, the LLM comes up with the following steps given the task definition for the relevance criterion:

Determine whether the recommendation directly addresses the vulnerability described in the context.

Assess whether the recommendation is practical and relevant for the specific issue identified.

Identify any irrelevant or generic information that makes the recommendation less applicable to the specific context of the vulnerability.

Provide a clear method for scoring the recommendation based on its relevance and applicability using a scale from 1 to 5.

Penalize the recommendation if the CVE and CWE do not match or if the suggestion is generic, irrelevant, or impractical.

We then supply the evaluation steps together with the vulnerable code and the retrieved context to the LLM-as-a-judge to perform the assessment and assign scores.

6.3. Scoring the Recommendations

The LLM assesses each recommendation based on the defined rubrics in Table 4 and assigns a score. We adopt the scoring mechanism proposed by Microsoft for the G-Eval framework [36]. Instead of outputting integer values as the final score, the probabilities of the output tokens of the LLM are used to normalize the predefined scores (1 through 5) and take the weighted sum. This is to prevent low variance results where the one digit dominates the score distribution and also to provide a more fine-grained assessment. This results in continuous scores between 0 and 1.

6.4. Assessment Results by LLM-as-a-Judge

We apply two LLM judges, GPT-4o and Google Gemini 2.5-Flash, to assess the generated recommendations across both datasets. The purpose is to observe whether LLM-based assessments can assign scores that faithfully reflect the quality of retrieval contexts and recommendations. The use of GPT-4o and Gemini as evaluators follows prior work showing that LLMs can assess generated outputs when guided by prompts and structured rubrics [33,36]. We also compare the quality of responses from our selected models using the retrieval strategies with VulMaster [5] using the LLM judges. The assessment is a comprehensive combination with two LLMs as judges, two datasets, zero-shot inference and four retrieval strategies, and four LLMs for recommendation. Figure 12 shows the plots for the performance of the selected models across all the LLM metrics and how they compare to VulMaster.

The results demonstrate that both GPT-4o and Gemini judges consistently assign lower scores to recommendations generated by zero-shot, the vanilla, and the metadata embedding strategies in all five criteria. Both judges evaluate that the metadata-driven strategy produces recommendations with the highest scores under the majority of the criteria. Both judges also agree that we get the best performance when GPT-4o uses the metadata-driven strategy to enrich its context to provide quality recommendations. We observe in Table 3 that VulMaster stands out on Exact Match (EM) benchmarks, reflecting its ability to reproduce reference patches exactly. However, when assessed with our LLM-based metrics, which focus on relevance, completeness, correctness, identification of vulnerable code, and code guidance, its performance is noticeably weaker, falling below MDR-augmented LLMs across nearly all criteria. In Figure 13, we see a comparison of a sample response for GPT-4o using the metadata-driven retrieval strategy and VulMaster. While VulMaster presents only the patch, GPT-4o identifies the issue, offers a recommendation, and provides a valid fix that explains how to address the vulnerability. This highlights how our approach goes beyond benchmark patch replication to deliver explainable, CWE-aligned fixes that can better support downstream automated vulnerability repair systems.

Answer to RQ1: The design and choice of the retrieval strategies is based on the code vulnerability dataset characteristics, such as token overlap and metadata uniqueness. Our evaluation shows that for our use case, these factors can hinder naive retrieval approaches, but they are better addressed when retrieval is guided by metadata filtering. Among the strategies we examined, the metadata-driven retrieval strategy which exploits such metadata filtering consistently provides the most accurate retrieval contexts and yields the highest recommendation quality across our LLM-based metrics. These findings were validated through both manual inspection and LLM-based assessments.

We further examined performance across programming languages using GPT-4o and the metadata-driven retrieval strategy. As shown in Figure 14, the distributions of judge scores are closely aligned with overlapping ranges, and no language shows consistent underperformance. These results indicate that the framework operates in a language-agnostic manner rather than favoring particular programming languages.

Answer to RQ2: The results confirm that our code vulnerability recommendation framework operates in a language-agnostic manner. As shown in Figure 14, the LLM judge scores across different languages show that no language consistently overperforms or underperforms. The recommendation quality remains consistent across programming languages, highlighting the framework’s applicability to diverse programming languages.

6.5. Detecting Self-Alignment Bias of GPT-4o

The self-alignment bias occurs when a model rates or evaluates its outputs more favorably than outputs from other models [38]. Further scrutinizing the results, it is notable that judge GPT-4o tends to assign higher score values on each criterion than the peer judge Gemini does. Moreover, in evaluations where GPT-4o serves as both the model and the judge, the GPT-4o judge gives the GPT-4o model consistently leading scores across all criteria and retrieval strategies with pronounced margins over models such as Llama-3 8B Instruct and Mixtral 8 × 7B Instruct. This leads to the question of whether an LLM judge has a self-alignment bias with a preference for its own recommendations.

We investigate self-alignment bias by introducing perturbations to GPT-4o’s recommendations using Llama-3 8B Instruct. The perturbations rephrase words or tokens to generate alternative versions of the recommendation. These minor changes do not alter the recommendation’s core semantic meanings and code repair validity. The aim is to determine whether the GPT-4o judge exhibits a preference for its recommendations over those recommendations with minor perturbations by Llama-3 8B Instruct. The hypothesis is that the evaluation scores assigned by the GPT-4o judge should remain consistent if the factual content is approximately equivalent.

We compute the difference in scores of two sets of recommendations. One set is the original recommendation from GPT-4o. The other set is the perturbation introduced by Llama-3 8B Instruct. The difference is measured for each criterion using Cohen’s d [77], as shown in Equation (1):

(1)d=X¯originalX¯perturbedsp

where

X¯original is the mean score assigned by GPT-4o to its own (original) recommendation.

X¯perturbed is the mean score for the perturbed versions of the recommendation.

sp is the pooled standard deviation of the scores.

In this paper, the effect size of Cohen’s d corresponds to the degree of self-alignment bias exhibited by the GPT-4o judge. A negligible effect size (d<0.2) indicates no meaningful bias. A small effect size (0.2d<0.5) suggests a slight bias, indicating GPT-4o slightly has a preference for its outputs. A medium effect size (0.5d<0.8) highlights a noticeable bias, and a large effect size (d0.8) implies GPT-4o has a strong preference for its original recommendations. The results are listed in Table 5.

The results show that GPT-4o generally favors the GPT-4o model’s recommendations over the perturbed recommendation by Llama-3 8B in most cases. The Cohen’s d values predominantly range from small to medium effect sizes. This trend highlights a mild but consistent bias toward its own outputs. There are a few negative or negligible Cohen’s d values, indicating that GPT-4o does not universally favor its recommendations.

Answer to RQ3: We observe that reference-based metrics such as CodeBLEU, chrF, and RUBY often assign high scores even to irrelevant or incomplete repairs generated as a result of the LLM recommendations, highlighting their limitations for assessing recommendation quality as they fail to capture contextual accuracy and functional correctness. By contrast, the LLM-as-a-Judge framework produces scores that align more closely with human judgment, emphasizing qualitative aspects such as relevance, completeness, and correctness that aren’t captured by SOTA reference-based metrics. However, the scores also show evidence of LLM self-alignment bias when evaluating their own recommendations. This highlights the necessity of introducing an independent evaluator to validate results when LLMs take on multifaceted roles in code vulnerability recommendation and assessment.

7. Guidelines for Reproducible and Extensible Adoption

Open access. Our approach leverages open-source data, models, and frameworks, ensuring reproducibility for diverse applications in software vulnerability analysis. Datasets such as CVEFixes and BigVul, and knowledge bases such as CWE data from Mitre form the foundation of this study. Reproducibility can be achieved by incorporating similar publicly available datasets.

Data characteristic driven. Among the retrieval strategies we explore, the metadata- driven retrieval strategy is the most effective at reducing token redundancy in embeddings and improving retrieval accuracy. These retrieval strategies are driven by the characteristics of the dataset. They are independent of the embedding models and can be applied to any vector database for embedding storage. While the metadata-driven strategy performs best for our use case, its effectiveness depends on the availability of structured metadata. Since this strategy works best for datasets with structured metadata, it may not generalize well to datasets lacking such structure. For datasets with different characteristics, new retrieval strategies can be designed and evaluated following the same data-centric approach. When a new strategy is defined for the characteristics of the data at hand, the three-stage pipeline framework is extensible to accommodate new data types and associated retrieval strategies.

Model agnostic. The framework is model-agnostic and integrates seamlessly with both open-source and proprietary LLMs. We demonstrate the adoption of GPT-4o, Llama-3 8B Instruct, CodeLlama-34B, and Mixtral 8 × 7B Instruct. Other models, such as CodeT5, can also replace any LLMs in this paper without changing the three-stage pipeline.

Alignment with human inspection. We advocate cross-validation of each stage of the recommendation and code vulnerability repair with SOTA metrics, human inspections on samples, and the LLM-as-a-judge approach.

Extensibility. Our three-stage pipeline framework is designed to be extensible. When data types with different characteristics or retrieval challenges are identified, new retrieval strategies can be incorporated seamlessly into this framework. This extensibility ensures that our approach can adapt to varying datasets and incorporate other retrieval strategies tailored to specific data characteristics.

Integration into developer workflows. Our framework can be integrated into existing vulnerability detection tools and complement them by providing repair recommendations once vulnerabilities are identified. For example, when vulnerabilities are detected in an IDE or during CI/CD pipeline scans, the detected CWE and CVE IDs, along with the vulnerable code snippets, can be passed into our system. The framework then retrieves relevant vulnerability information and historical examples to generate a repair recommendation, making it suitable for incorporation into existing CI/CD pipelines or developer-facing tools to assist in vulnerability repair.

8. Threats to Validity

To address threats to external validity, we focus on datasets that are widely used in the vulnerability research community. These datasets provide a solid basis for evaluating retrieval and recommendation performance. However, we acknowledge that relying on only two datasets may limit the generalizability of our findings to other vulnerability types or domains. Future work could extend our approach to additional datasets to validate the results further.

An internal threat to validity is the potential bias in the results when GPT-4o is used as both a recommendation model and a judge. We check for bias using Cohen’s d and find that the bias ranges from negligible to medium. To address this concern and verify GPT-4o’s assessment results, we cross-validate the results with Gemini as an independent judge and find that their assessments largely align, verifying our earlier findings with GPT-4o. Our objective in this work is to measure and acknowledge the possible presence of such bias, not to develop or evaluate mitigation strategies. Addressing this bias would require a dedicated experimental design and is therefore beyond the scope of this study. Future work could address the issue more directly by fully separating the roles of generation and evaluation.

Although our CVE-based splitting and duplicate removal prevent identical instances from appearing in both training and test sets, residual contamination remains possible. As noted by Matton et al. [52], LLMs may still recall evaluation examples encountered during pretraining if those examples exist in public sources such as GitHub, from which BigVul and CVEFixes were derived. This constitutes a form of direct data leakage that our methodology cannot fully eliminate without full transparency in the model’s original training data.

Another threat is the manual inspection we conducted to categorize the evaluation scenarios by selecting 900 samples from the entire combined dataset. Our goal is mainly to identify the existence of inadequate metric-based assessment for code repair recommendations.

Given the random nature of LLM outputs, the reliability of our findings depends on the consistency of our recommendations across multiple runs. To mitigate this, we conducted multiple trials for each retrieval strategy and recorded the average performance to account for potential variability.

9. Conclusions and Future Work

This paper presents multifaceted LLMs in a comprehensive framework for accurate and relevant recommendations on vulnerable code repair. We compare four retrieval strategies with the aim of improving the accuracy of the retrieval context. The metadata-driven retrieval strategy emerged as the best-performing design for our use case based on its performance as compared to the other strategies. Our assessment of the code recommendation generation pipeline leverages both reference-based metrics and an LLM-as-a-Judge approach, as well as static code analysis, applying cross-validation with multiple evaluators to ensure consistency, thereby establishing a framework for assessing recommendation quality. Future work will explore the use of graph retrieval and augmentation strategies to capture hierarchical relationships between CWEs and CVEs, enabling richer context retrieval and potentially enhancing performance on less frequent CWE types. Further refinements to our criteria and evaluation methods, as well as testing on broader vulnerability datasets, will also extend the applicability of this approach across diverse software security contexts.   

Author Contributions

Conceptualization, A.A.A. and Y.L.; methodology, A.A.A. and Y.L.; software, A.A.A.; validation, A.A.A. and Y.L.; formal analysis, A.A.A. and Y.L.; investigation, A.A.A. and Y.L.; resources, Y.L.; data curation, A.A.A.; writing—original draft preparation, A.A.A.; writing—review and editing, A.A.A. and Y.L.; visualization, A.A.A.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The datasets used in this study are from BigVul [18] and CVEFixes [19] which are both open-source. The code and the processed data samples are available on GitHub: https://github.com/alfredasare/llm-for-code-recommendation accessed on 15 October 2025.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 The pipeline framework of code vulnerability repair recommendations using four distinct retrieval strategies. The pipeline integrates LLMs with retrieved information about the vulnerability to provide a recommendation, followed by an evaluation phase using static code analysis, metrics (such as CodeBLEU, chrF, and RUBY), and an LLM-as-a-Judge framework to assess recommendation quality.

View Image -

Figure 2 Cosine similarity scores of 10 CVEs belonging to CWE-189.

View Image -

Figure 3 Language distribution in the CVEFixes dataset.

View Image -

Figure 4 Workflow showing how the Vanilla strategy handles data processing and retrieval.

View Image -

Figure 5 Workflow showing how the Metadata Embedding Strategy handles data processing and retrieval.

View Image -

Figure 6 Workflow showing how the Segmented Context Strategy handles data processing and retrieval.

View Image -

Figure 7 Workflow showing how the Metadata-Driven Retrieval Strategy handles data processing and retrieval.

View Image -

Figure 8 F1 score of each retrieval strategy across knowledge bases (VS: Vanilla; MES: Metadata Embedding; SC: Segmented Context; MDR: Metadata-Driven Retrieval).

View Image -

Figure 9 Token length distribution in the CVEFixes and BigVul datasets. The token length breakpoints 8K, 16K, 32K, and 128K correspond to the token lengths of our selected models: Llama-3 8B Instruct, CodeLlama-34B, Mixtral 8 × 7B Instruct, and GPT-4o, respectively.

View Image -

Figure 10 Large code snippets are recursively split into chunks that are processed in parallel by an LLM agent with LLM-powered tool support to find vulnerable segments using the retrieved context as guidance. The vulnerable snippets, with the retrieved context, are then aggregated to generate a final repair recommendation.

View Image -

Figure 11 Baseline establishment using static code analysis and metrics scores for the LLM in zero-shot setting. Segment 1 shows the use of static code analysis to measure the vulnerability repair rate. Segment 2 measures the CodeBLEU, chrF, and RUBY scores of the code generated by the model using the recommendations, and Segment 3 shows the LLM-as-a-judge scores for the recommendations generated by the model in zero-shot. These set up baseline scores with which the retrieval strategy scores are compared.

View Image -

Figure 12 LLM-as-a-Judge evaluation of recommendation quality across models and metrics. Each plot shows performance across different retrieval contexts. The red horizontal band indicates VulMaster’s reported score range. The results show that VulMaster consistently underperforms LLM-based recommendations with the metadata-driven retrieval strategy context augmented on these qualitative criteria.

View Image -

View Image -

Figure 13 Comparison between our metadata-driven GPT-4o recommendation (left) and VulMaster’s fix (right). Our approach highlights the issue, provides a recommendation, and shows the corresponding fix, making it explainable and suitable for downstream tasks such as automated code vulnerability fixes. In contrast, VulMaster outputs only the patch lines, offering no rationale or guidance, thus lacking explainability.

View Image -

Figure 14 GPT-4o recommendation quality across programming languages using the LLM-as-a-Judge metrics. Overlapping ranges and similar medians indicate consistent performance independent of programming language.

View Image -

Reference-based Metrics comparison of recommendation based on four retrieval augmentation strategies using CodeBLEU (Token overlap and syntax [12]), chrF (N-gram character level overlap [55]) and RUBY(Stylistic and quality aspects [56]).

Dataset Strategy GPT-4o Llama-3 8B Mixtral 8 × 7B CodeLlama 34B
CodeBLEU chrF RUBY CodeBLEU chrF RUBY CodeBLEU chrF RUBY CodeBLEU chrF RUBY
BigVul Zero-shot 0.62 0.77 0.53 0.57 0.67 0.52 0.50 0.65 0.45 0.52 0.54 0.36
Vanilla 0.67 0.82 0.50 0.62 0.72 0.49 0.53 0.69 0.45 0.55 0.60 0.35
Metadata Embedding 0.69 0.85 0.55 0.63 0.72 0.58 0.57 0.70 0.50 0.57 0.52 0.33
Segmented Context 0.66 0.81 0.49 0.68 0.81 0.55 0.54 0.72 0.46 0.55 0.62 0.37
Metadata-driven Retrieval 0.70 0.89 0.60 0.69 0.84 0.59 0.58 0.75 0.46 0.56 0.59 0.42
CVEFixes Zero-shot 0.29 0.33 0.26 0.26 0.27 0.31 0.24 0.22 0.24 0.31 0.34 0.28
Vanilla 0.31 0.40 0.32 0.31 0.29 0.30 0.29 0.25 0.31 0.33 0.41 0.32
Metadata Embedding 0.32 0.41 0.36 0.28 0.30 0.32 0.28 0.32 0.32 0.30 0.38 0.34
Segmented Context 0.39 0.39 0.32 0.35 0.33 0.31 0.29 0.38 0.31 0.31 0.40 0.30
Metadata-driven Retrieval 0.43 0.41 0.39 0.40 0.35 0.34 0.39 0.40 0.37 0.34 0.42 0.33

Comparing zero-shot to Metadata-driven Retrieval using GPT-4o.

Zero-Shot Metadata-Driven Retrieval
Dataset Fixed Files Resolved Issues Fixed Files Resolved Issues
BigVul 102 (28%) 151 (33%) 192 (52.6%) 235 (51.2%)
CVEFixes 481 (58.4%) 2996 (73%) 702 (85.32%) 3827 (92.3%)

Comparison of MDR-augmented LLMs with VulMaster and other baselines.

Model EM (%) CodeBLEU chrF RUBY
CodeLlama 34B + MDR 3.7 0.45 0.47 0.38
Mixtral 8 × 7B + MDR 4.3 0.51 0.55 0.42
Llama-3 8B + MDR 5.0 0.53 0.58 0.46
GPT-4o + MDR 10.2 0.57 0.65 0.49
CodeBERT [60] 7.3 0.22 N/A N/A
GraphCodeBERT [61] 8.1 0.17 N/A N/A
PolyCoder [62] 9.9 0.30 N/A N/A
CodeGen [63] 12.2 0.30 N/A N/A
Codereviewer [64] 10.2 0.38 N/A N/A
CodeT5 [51] 16.8 0.35 N/A N/A
VRepair [22] 8.9 0.32 N/A N/A
VulRepair [8] 16.8 0.35 N/A N/A
VulMaster [5] 20.0 0.41 N/A N/A

Defined rubrics for evaluating the quality of vulnerability repair recommendations based on five criteria.

Criteria Score 1 Score 2 Score 3 Score 4 Score 5
Relevance Irrelevant and not applicable to the identified issue. Partially relevant but has limited applicability. Generally relevant and mostly applicable, with some minor issues. Relevant and applicable, with minor improvements possible. Highly relevant, very applicable, and practical to implement.
Completeness Very incomplete, lacking thoroughness, and missing vital aspects. Partially complete, with significant gaps in thoroughness. Generally complete and adequately thorough, with some minor gaps. Complete and thorough, with minor improvements possible. Very complete, highly thorough, and comprehensive in its approach.
Correctness Very inaccurate, with multiple errors and unreliable information. Somewhat inaccurate, with several errors that impact its reliability. Generally accurate, with minor errors that do not significantly impact its reliability. Accurate, with minor errors, with most of the information being reliable. Highly accurate and free of errors, with all information being reliable.
Id. of Vulnerable Code Does not identify any specific vulnerable code. Vaguely identifies the vulnerable code with unclear details. Generally identifies the vulnerable code, with some clarity issues. Correctly identifies the vulnerable code, with explanations missing a few details. Specifically identifies and clearly explains the vulnerable code.
Code Guidance No relevant code snippets are provided in the recommendation. The provided code snippets are irrelevant or unhelpful. Includes some relevant snippets, but there are minor issues with clarity or usefulness. Provides relevant and helpful snippets, with minor improvements possible. Provides highly relevant and helpful snippets, offering exemplary guidance.

Evaluation of GPT-4o Bias Across Retrieval Strategies and Datasets. The results show that the bias ranges from negligible to medium in terms of Cohen’s d. (REL: Relevance, COM: Completeness, COR: Correctness, IVC: Identification of vulnerable code, CG: Code Guidance).

Dataset Strategy REL COM COR IVC CG
BigVul VS 0.594 0.314 0.435 0.387 0.433
MES 0.641 0.497 0.515 0.553 0.668
SC 0.404 0.231 0.694 0.446 0.528
MDR 0.669 0.567 0.701 0.431 0.340
CVEFixes VS 0.510 0.585 0.373 0.222 0.294
MES 0.430 0.635 0.609 0.625 0.601
SC −0.218 0.383 0.031 −0.162 −0.010
MDR 0.317 0.530 0.465 0.678 0.603

References

1. NIST. NVD—Vulnerabilities. 2019; Available online: https://nvd.nist.gov/vuln (accessed on 8 August 2025).

2. Octoverse 2024: The State of Open Source. Available online: https://github.blog/news-insights/octoverse/octoverse-2024/ (accessed on 8 August 2025).

3. Homaei, H.; Shahriari, H.R. Seven Years of Software Vulnerabilities: The Ebb and Flow. IEEE Secur. Priv.; 2017; 15, pp. 58-65. [DOI: https://dx.doi.org/10.1109/MSP.2017.15]

4. Shahriar, H.; Zulkernine, M. Mitigating program security vulnerabilities: Approaches and challenges. ACM Comput. Surv.; 2012; 44, 11. [DOI: https://dx.doi.org/10.1145/2187671.2187673]

5. Zhou, X.; Kim, K.; Xu, B.; Han, D.; Lo, D. Out of Sight, Out of Mind: Better Automatic Vulnerability Repair by Broadening Input Ranges and Sources. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering; New York, NY, USA, 14–20 April 2024; [DOI: https://dx.doi.org/10.1145/3597503.3639222]

6. Zou, D.; Wang, S.; Xu, S.; Li, Z.; Jin, H. μ VulDeePecker: A Deep Learning-Based System for Multiclass Vulnerability Detection. IEEE Trans. Dependable Secur. Comput.; 2021; 18, pp. 2224-2236. [DOI: https://dx.doi.org/10.1109/TDSC.2019.2942930]

7. Jiang, N.; Lutellier, T.; Tan, L. CURE: Code-Aware Neural Machine Translation for Automatic Program Repair. Proceedings of the 43rd International Conference on Software Engineering; Madrid, Spain, 22–30 May 2021; pp. 1161-1173. [DOI: https://dx.doi.org/10.1109/ICSE43902.2021.00107]

8. Fu, M.; Tantithamthavorn, C.; Le, T.; Nguyen, V.; Phung, D. VulRepair: A T5-based automated software vulnerability repair. Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering; New York, NY, USA, 14–16 November 2022; pp. 935-947. [DOI: https://dx.doi.org/10.1145/3540250.3549098]

9. Berabi, B.; Gronskiy, A.; Raychev, V.; Sivanrupan, G.; Chibotaru, V.; Vechev, M.T. DeepCode AI Fix: Fixing Security Vulnerabilities with Large Language Models. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2402.13291] arXiv: 2402.13291

10. Hou, X.; Zhao, Y.; Liu, Y.; Yang, Z.; Wang, K.; Li, L.; Luo, X.; Lo, D.; Grundy, J.C.; Wang, H. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv; 2023; [DOI: https://dx.doi.org/10.1145/3695988] arXiv: 2308.10620

11. Mashhadi, E.; Hemmati, H. Applying CodeBERT for Automated Program Repair of Java Simple Bugs. arXiv; 2021; [DOI: https://dx.doi.org/10.48550/arXiv.2103.11626] arXiv: 2103.11626v2

12. Ren, S.; Guo, D.; Lu, S.; Zhou, L.; Liu, S.; Tang, D.; Zhou, M.; Blanco, A.; Ma, S. CodeBLEU: A Method for Automatic Evaluation of Code Synthesis. arXiv; 2020; [DOI: https://dx.doi.org/10.48550/arXiv.2009.10297] arXiv: 2009.10297

13. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T. . Retrieval-augmented generation for knowledge-intensive NLP tasks. Proceedings of the 34th International Conference on Neural Information Processing Systems; Red Hook, NY, USA, 6–12 December 2020.

14. OpenAI. Hello GPT-4o. 2024; Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 6 December 2024).

15. Meta. Meta-Llama-3-8B-Instruct. 2024; Available online: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct (accessed on 6 December 2024).

16. Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Hanna, E.B.; Bressand, F. . Mixtral of Experts. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2401.04088] arXiv: 2401.04088

17. Rozière, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T. . Code Llama: Open Foundation Models for Code. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2308.12950] arXiv: 2308.12950

18. Fan, J.; Li, Y.; Wang, S.; Nguyen, T.N. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. Proceedings of the 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR); Seoul, Republic of Korea, 29–30 June 2020; pp. 508-512. [DOI: https://dx.doi.org/10.1145/3379597.3387501]

19. Bhandari, G.; Naseer, A.; Moonen, L. CVEfixes: Automated collection of vulnerabilities and their fixes from open-source software. Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering; Athens, Greece, 19–20 August 2021; pp. 30-39. [DOI: https://dx.doi.org/10.1145/3475960.3475985]

20. MITRE. CVE Website. 2025; Available online: https://www.cve.org/ (accessed on 8 October 2025).

21. CWE—Common Weakness Enumeration. Available online: https://cwe.mitre.org/ (accessed on 8 October 2025).

22. Chen, Z.; Kommrusch, S.; Monperrus, M. Neural Transfer Learning for Repairing Security Vulnerabilities in C Code. IEEE Trans. Softw. Eng.; 2023; 49, pp. 147-165. [DOI: https://dx.doi.org/10.1109/TSE.2022.3147265]

23. Pearce, H.; Tan, B.; Ahmad, B.; Karri, R.; Dolan-Gavitt, B. Examining Zero-Shot Vulnerability Repair with Large Language Models. arXiv; 2022; [DOI: https://dx.doi.org/10.48550/arXiv.2112.02125] arXiv: 2112.02125

24. Russell, R.L.; Kim, L.; Hamilton, L.H.; Lazovich, T.; Harer, J.A.; Ozdemir, O.; Ellingwood, P.M.; McConley, M.W. Automated Vulnerability Detection in Source Code Using Deep Representation Learning. arXiv; 2018; [DOI: https://dx.doi.org/10.48550/arXiv.1807.04320] arXiv: 1807.04320

25. Wu, Y.; Jiang, N.; Pham, H.V.; Lutellier, T.; Davis, J.; Tan, L.; Babkin, P.; Shah, S. How Effective Are Neural Networks for Fixing Security Vulnerabilities. Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis; Seattle, WA, USA, 17–21 July 2023; pp. 1282-1294. [DOI: https://dx.doi.org/10.1145/3597926.3598135]

26. Islam, N.T.; Khoury, J.; Seong, A.; Karkevandi, M.B.; Parra, G.D.L.T.; Bou-Harb, E.; Najafirad, P. LLM-Powered Code Vulnerability Repair with Reinforcement Learning and Semantic Reward. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2401.03374] arXiv: 2401.03374

27. Joshi, H.; Cambronero, J.; Gulwani, S.; Le, V.; Radicek, I.; Verbruggen, G. Repair Is Nearly Generation: Multilingual Program Repair with LLMs. arXiv; 2022; [DOI: https://dx.doi.org/10.1609/aaai.v37i4.25642] arXiv: 2208.11640

28. de Fitero-Dominguez, D.; Garcia-Lopez, E.; Garcia-Cabot, A.; Martinez-Herraiz, J.J. Enhanced automated code vulnerability repair using large language models. Eng. Appl. Artif. Intell.; 2024; 138, 109291. [DOI: https://dx.doi.org/10.1016/j.engappai.2024.109291]

29. Evtikhiev, M.; Bogomolov, E.; Sokolov, Y.; Bryksin, T. Out of the BLEU: How should we assess quality of the Code Generation models?. J. Syst. Softw.; 2023; 203, 111741. [DOI: https://dx.doi.org/10.1016/j.jss.2023.111741]

30. Fu, M.; Tantithamthavorn, C.; Nguyen, V.; Le, T. ChatGPT for Vulnerability Detection, Classification, and Repair: How Far Are We?. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2310.09810] arXiv: 2310.09810

31. Ahmed, T.; Ghosh, S.; Bansal, C.; Zimmermann, T.; Zhang, X.; Rajmohan, S. Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2301.03797] arXiv: 2301.03797

32. Bavaresco, A.; Bernardi, R.; Bertolazzi, L.; Elliott, D.; Fernández, R.; Gatt, A.; Ghaleb, E.; Giulianelli, M.; Hanna, M.; Koller, A. . LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2406.18403] arXiv: 2406.18403

33. Zhu, L.; Wang, X.; Wang, X. JudgeLM: Fine-tuned Large Language Models are Scalable Judges. arXiv; 2025; arXiv: 2310.17631

34. Fu, J.; Ng, S.K.; Jiang, Z.; Liu, P. GPTScore: Evaluate as You Desire. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2302.04166] arXiv: 2302.04166

35. Kim, T.S.; Lee, Y.; Shin, J.; Kim, Y.H.; Kim, J. EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria. Proceedings of the CHI Conference on Human Factors in Computing Systems; Honolulu, HI, USA, 11–16 May 2024; Volume 35, pp. 1-21. [DOI: https://dx.doi.org/10.1145/3613904.3642216]

36. Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2303.16634] arXiv: 2303.16634

37. Hu, R.; Cheng, Y.; Meng, L.; Xia, J.; Zong, Y.; Shi, X.; Lin, W. Training an LLM-as-a-Judge Model: Pipeline, Insights, and Practical Lessons. arXiv; 2025; [DOI: https://dx.doi.org/10.1145/3701716.3715265] arXiv: 2502.02988

38. Ye, J.; Wang, Y.; Huang, Y.; Chen, D.; Zhang, Q.; Moniz, N.; Gao, T.; Geyer, W.; Huang, C.; Chen, P.Y. . Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2410.02736] arXiv: 2410.02736

39. Doostmohammadi, E.; Norlund, T.; Kuhlmann, M.; Johansson, R. Surface-Based Retrieval Reduces Perplexity of Retrieval-Augmented Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Rogers, A.; Boyd-Graber, J.; Okazaki, N. Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 521-529. [DOI: https://dx.doi.org/10.18653/v1/2023.acl-short.45]

40. Setty, S.; Thakkar, H.; Lee, A.; Chung, E.; Vidra, N. Improving Retrieval for RAG based Question Answering Models on Financial Documents. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2404.07221] arXiv: 2404.07221

41. Yepes, A.J.; You, Y.; Milczek, J.; Laverde, S.; Li, R. Financial Report Chunking for Effective Retrieval Augmented Generation. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2402.05131] arXiv: 2402.05131

42. LangChain. LangChain. 2024; Available online: https://www.langchain.com/ (accessed on 9 October 2024).

43. Blog, N.T. Introduction to LLM Agents. 2023; Available online: https://developer.nvidia.com/blog/introduction-to-llm-agents/ (accessed on 4 December 2024).

44. Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K. . Gemini: A Family of Highly Capable Multimodal Models. arXiv; 2024; arXiv: 2312.11805

45. The Claude 3 Model Family: Opus, Sonnet, Haiku. Available online: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf (accessed on 12 November 2024).

46. Guo, D.; Zhu, Q.; Yang, D.; Xie, Z.; Dong, K.; Zhang, W.; Chen, G.; Bi, X.; Wu, Y.; Li, Y.K. . DeepSeek-Coder: When the Large Language Model Meets Programming—The Rise of Code Intelligence. arXiv; 2024; arXiv: 2401.14196

47. Ding, N.; Chen, Y.; Xu, B.; Qin, Y.; Zheng, Z.; Hu, S.; Liu, Z.; Sun, M.; Zhou, B. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2305.14233] arXiv: 2305.14233

48. xai. Grok OS. Available online: https://x.ai/news/grok-os (accessed on 10 March 2025).

49. Du, N.; Huang, Y.; Dai, A.M.; Tong, S.; Lepikhin, D.; Xu, Y.; Krikun, M.; Zhou, Y.; Yu, A.W.; Firat, O. . GLaM: Efficient Scaling of Language Models with Mixture-of-Experts. arXiv; 2022; [DOI: https://dx.doi.org/10.48550/arXiv.2112.06905] arXiv: 2112.06905

50. Sanseviero, O.; Tunstall, L.; Schmid, P.; Mangrulkar, S.; Belkada, Y.; Cuenca, P. Mixture of Experts Explained. 2023; Available online: https://huggingface.co/blog/moe (accessed on 12 November 2024).

51. Wang, Y.; Wang, W.; Joty, S.; Hoi, S.C.H. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. arXiv; 2021; arXiv: 2109.00859

52. Matton, A.; Sherborne, T.; Aumiller, D.; Tommasone, E.; Alizadeh, M.; He, J.; Ma, R.; Voisin, M.; Gilsenan-McMahon, E.; Gallé, M. On Leakage of Code Generation Evaluation Datasets. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024; Al-Onaizan, Y.; Bansal, M.; Chen, Y.N. Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 13215-13223. [DOI: https://dx.doi.org/10.18653/v1/2024.findings-emnlp.772]

53. Jiang, X.; Wu, L.; Sun, S.; Li, J.; Xue, J.; Wang, Y.; Wu, T.; Liu, M. Investigating Large Language Models for Code Vulnerability Detection: An Experimental Study. arXiv; 2025; arXiv: 2412.18260

54. Du, X.; Zheng, G.; Wang, K.; Feng, J.; Deng, W.; Liu, M.; Chen, B.; Peng, X.; Ma, T.; Lou, Y. Vul-RAG: Enhancing LLM-based Vulnerability Detection via Knowledge-level RAG. arXiv; 2024; arXiv: 2406.11147

55. Popovic, M. chrF: Character n-gram F-score for automatic MT evaluation. Proceedings of the Tenth Workshop on Statistical Machine Translation; Lisboa, Portugal, 17–18 September 2015.

56. Tran, N.; Tran, H.; Nguyen, S.; Nguyen, H.; Nguyen, T. Does BLEU Score Work for Code Migration?. Proceedings of the 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC); Los Alamitos, CA, USA, 25–26 May 2019; pp. 165-176. [DOI: https://dx.doi.org/10.1109/ICPC.2019.00034]

57. GitHub. CodeQL Documentation. 2025; Available online: https://codeql.github.com/docs/ (accessed on 10 March 2025).

58. Snyk. Developer Security. 2025; Available online: https://snyk.io/ (accessed on 10 March 2025).

59. SonarSource. Advanced Security with SonarQube. 2025; Available online: https://www.sonarsource.com/solutions/security/ (accessed on 30 March 2025).

60. Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D. . CodeBERT: A Pre-Trained Model for Programming and Natural Languages. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020; Online, 16–20 November 2020; pp. 1536-1547. [DOI: https://dx.doi.org/10.18653/v1/2020.findings-emnlp.139]

61. Guo, D.; Ren, S.; Lu, S.; Feng, Z.; Tang, D.; Liu, S.; Zhou, L.; Duan, N.; Yin, J.; Jiang, D. . GraphCodeBERT: Pre-training Code Representations with Data Flow. arXiv; 2020; arXiv: 2009.08366

62. Xu, F.F.; Alon, U.; Neubig, G.; Hellendoorn, V.J. A systematic evaluation of large language models of code. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, MAPS 2022; New York, NY, USA, 13 June 2022; pp. 1-10. [DOI: https://dx.doi.org/10.1145/3520312.3534862]

63. Nijkamp, E.; Pang, B.; Hayashi, H.; Tu, L.; Wang, H.; Zhou, Y.; Savarese, S.; Xiong, C. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. Proceedings of the International Conference on Learning Representations; Online, 25–29 April 2022.

64. Li, Z.; Lu, S.; Guo, D.; Duan, N.; Jannu, S.; Jenks, G.; Majumder, D.; Green, J.; Svyatkovskiy, A.; Fu, S. . Automating code review activities by large-scale pre-training. Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022; New York, NY, USA, 14–18 November 2022; pp. 1035-1047. [DOI: https://dx.doi.org/10.1145/3540250.3549081]

65. Peng, J.; Cui, L.; Huang, K.; Yang, J.; Ray, B. CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation. arXiv; 2025; [DOI: https://dx.doi.org/10.48550/arXiv.2501.08200] arXiv: 2501.08200

66. Confident AI. LLM Evaluation Metrics: Everything You Need for LLM Evaluation. 2024; Available online: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation (accessed on 2 December 2024).

67. Microsoft. Evaluation Metrics Built-In. Available online: https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-metrics-built-in?tabs=warning (accessed on 6 March 2025).

68. Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y. . A Survey on Evaluation of Large Language Models. arXiv; 2023; [DOI: https://dx.doi.org/10.1145/3641289] arXiv: 2307.03109

69. Friel, R.; Sanyal, A. Chainpoll: A high efficacy method for LLM hallucination detection. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2310.18344] arXiv: 2310.18344

70. Galileo. Completeness. Available online: https://docs.galileo.ai/galileo/gen-ai-studio-products/galileo-guardrail-metrics/completeness (accessed on 6 March 2025).

71. Lemos, R. Patch Imperfect: Software Fixes Failing to Shut Out Attackers. Available online: https://www.darkreading.com/vulnerabilities-threats/patch-imperfect-software-fixes-failing-to-shut-out-attackers (accessed on 6 March 2025).

72. Magazine, I. Google: Incomplete Patches Caused Quarter of Vulnerabilities, in Some Cases. Available online: https://www.infosecurity-magazine.com/news/google-incomplete-patches-quarter/#:~:text=Google (accessed on 6 March 2025).

73. Galileo. Correctness. Available online: https://docs.galileo.ai/galileo/gen-ai-studio-products/galileo-guardrail-metrics/correctness (accessed on 6 March 2025).

74. Qi, Z.; Long, F.; Achour, S.; Rinard, M. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. Proceedings of the 2015 International Symposium on Software Testing and Analysis, ISSTA 2015; New York, NY, USA, 13–17 July 2015; pp. 24-36. [DOI: https://dx.doi.org/10.1145/2771783.2771791]

75. Liu, P.; Liu, J.; Fu, L.; Lu, K.; Xia, Y.; Zhang, X.; Chen, W.; Weng, H.; Ji, S.; Wang, W. Exploring ChatGPT’s capabilities on vulnerability management. Proceedings of the 33rd USENIX Conference on Security Symposium SEC ’24; Philadelphia, PA, USA, 14–16 August 2024.

76. Liu, M.; Wang, J.; Lin, T.; Ma, Q.; Fang, Z.; Wu, Y. An Empirical Study of the Code Generation of Safety-Critical Software Using LLMs. Appl. Sci.; 2024; 14, 1046. [DOI: https://dx.doi.org/10.3390/app14031046]

77. Diener, M.J. Cohen’s d. The Corsini Encyclopedia of Psychology; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2010; 1. [DOI: https://dx.doi.org/10.1002/9780470479216.corpsy0200]

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.