Automated vulnerability evaluation with large

Full text

Turn on search term navigation

INTRODUCTION

A software vulnerability is a flaw that malicious attackers can exploit to compromise a system's confidentiality, integrity, or availability (CIA), such as buffer overflows, SQL injection, and XSS. Regulatory authorities, non-profits, and companies have created open-source standards for sharing vulnerability information, resulting in continually updated databases by cybersecurity professionals. Examples of such databases include: (a) Common Vulnerabilities and Exposures (CVE)¹: Catalogs software vulnerabilities (Mitre CVE 2005), (b) Common Weakness Enumeration (CWE)²: Describes software security weaknesses, (c) Common Attack Pattern Enumeration and Classification (CAPEC)³: Details known attack and mitigation patterns, and (d) Common Vulnerability Scoring System (CVSS): Assesses vulnerability severity (Mell, Scarfone, and Romanosky 2006).

Among these, the CVE database stands out for its detailed descriptions and assessments of vulnerabilities. The CVE description offers a concise overview of the vulnerability, outlining its nature, affected products, potential impact, and conditions for exploitation. CVSS reports CVE severity in two forms: CVSS Score and CVSS Vector. The CVSS Score, a composite score on a scale of 1 to 10, offers a quick overview of the vulnerability's impact. It is derived from CVSS Vectors, which comprise several metrics evaluating the exploitability of the vulnerability, such as Base (inherent characteristics), Temporal (changes over time), and Environmental (local CIA requirements and controls). The CVE database includes Base and Temporal metrics, while Environmental metrics are calculated by cybersecurity experts for specific products, ensuring accurate severity assessment within their context.

Medical devices are particularly vulnerable to security threats that can compromise patient safety, privacy, medical data integrity, and the availability of diagnostic or treatment devices. According to the US FDA guidance⁴, Medical Device Manufacturers (MDMs) must monitor third-party software components for emerging vulnerabilities and assess their impact on the device's safety and security.

Vulnerability management involves three steps: detection, evaluation, and mitigation. Component vendors detect and issue notifications for vulnerabilities, which MDMs must monitor and assess for relevance to their products. Mitigation may involve vendor patches or MDM-implemented controls. Sometimes, CVE vulnerabilities may not pose a risk, requiring only communication to customers.

The notifications generated by component vendors consist of one or more CVEs that impact their component. When these notifications are issued, Siemens Healthineers (SHS) Cybersecurity experts conduct manual vulnerability evaluations utilizing CVE descriptions, CVSS scores, and the related knowledge in CWE and CAPEC. In SHS, vulnerability evaluation involves the following assessments based on asset details and the notification: (a) VEXCategory, (b)VEXJustification, (c) InternalComment, (d) CustomerComment, and (e) Vector. Table 1 provides a brief description of these evaluation types.

TABLE 1 Details for Evaluations performed manually currently by SHS cybersecurity experts.

Evaluations	Description	Value type
VEXCategory	Indicates whether the asset is affected by the notification.	Binary categorical values
VEXJustification	Provides further explanation if the asset is not affected.	Multiclass categorical values
InternalComment	Details the problem and advised internal solution.	Text
CustomerComment	Summarizes the notification's impact and solution directed toward the customer.	Text
Vector	Evaluates the CVSS Environmental metrics.	Multiclass, multilabel categorical values

Given the extensive product portfolios ( $\approx$ 1.7K products) of SHS and numerous third-party components, manually performing these evaluations ( $\approx$ 1.7 million/month) can be overwhelming. Automation of vulnerability evaluations serves two purposes: (a) quick and efficient management of the large volume of evaluations required, ensuring consistent quality over time, and (b) mitigation of the risk associated with the potential unavailability of knowledgeable experts, who may no longer be in the same role or company during the entire post-market lifetime of a device, which can span up to 20 years.

From a machine learning perspective, SHS vulnerability evaluation tasks in Table 1 could be categorized into two primary tasks: text generation (InternalComment and CustomerComment) and text classification (VEXCategory, VEXJustification, and Vector). Vector generation is a multi-label, multi-class classification problem, VEXCategory is binary classification, and VEXJustification is multi-class classification.

Recent advancements in transformer models (Devlin et al. 2018) have significantly improved language generation and reasoning capabilities across various natural language processing tasks (Li et al. 2023; Sallam 2023; Wu et al. 2023; Yang et al. 2024). Although encoder models have traditionally been employed for CVE classification in vulnerability evaluation, the progress in generative AI, particularly decoder-based LLMs, has largely been applied only to penetration testing (Deng et al. 2024; Happe and Cito 2023). There is a lack of real-world experiments on the effectiveness of LLMs in vulnerability evaluation. To address the challenge MDMs face in assessing a large number of vulnerabilities, we present an LLM trained on historical evaluation data to assess the impact of vulnerabilities on assets. This approach enables faster detection and assessment of exploitable vulnerabilities, leading to quicker mitigation.

In this work, we developed and deployed an LLM to assist product cybersecurity experts in vulnerability evaluation. Overall our contributions are:

1.This work expands our previous paper on CVE-LLM (Ghosh et al. 2025) that explores using LLMs for vulnerability assessment based solely on asset and vulnerability descriptions in the medical technology industry setting, and includes further experiments and guardrails for model deployment.
2.We benchmarked our model against other fine-tuned open-source LLMs and provided insights into best practices for training and deploying a vulnerability LLM.
3.Additionally, we demonstrated how incorporating knowledge from vulnerability knowledge graphs into LLM training enhances performance, particularly in deployment settings.

DATASETS

Our methodology adheres to the pretrain-then-finetune paradigm, as shown in Figure 1. We conduct Domain Adaptive Pretraining (DAPT) using the open-source MPT-7B model (MosaicML 2023), which is followed by Supervised Finetuning (SFT) to develop our model, designated as CVE-LLM. Thus, we construct two distinct datasets: the DAPT dataset for the pretraining phase and the SFT dataset for the subsequent fine-tuning phase.

DAPT dataset

We have formed a DAPT dataset consisting of 350K vulnerability description documents, which combine publicly available CVEs (248K) from the NVD (National Vulnerability Database) and SHS organization-wide vulnerability documents (102K). The organization-wide vulnerability documents describe a combination of one or more vulnerabilities, their effects, and possible mitigations for the affected products. The DAPT dataset includes the CVE title, description, CVSS vector description, affected and unaffected software versions, and mitigation measures (if present).

The CVE descriptions are cleaned by removing URLs and non-UTF8 characters. We also create a template-based description of the CVSS vectors. For example, for a base vector AV:L, the vector description is Attack Vector is Local. Details of vectors and their description can be found in FIRST⁵ specification document.

SFT dataset

The SFT dataset is constructed using three organization-wide datasets: (a) Assets, (b) Notifications, and (c) Evaluations. It is further enriched with knowledge from cybersecurity databases and formatted to perform instruction tuning on the DAPT model.

The Assets dataset includes all SHS products with details about the software version, third-party components, and associated sub-organizations. The Notifications dataset is a compilation of vulnerabilities (CVEs) that affect the components, detailing affected components and CVSS base and temporal metrics. The Evaluations dataset contains expert manual assessments of the impact of a notification on an asset, including VEXCategory, VEXJustification, InternalComment, CustomerComment and Vector.

At the time of writing this paper, there are around 1.7K assets, 145K notifications, and 208K evaluations in Assets, Notifications and Evaluations, respectively. There are 152K unique components used by all the assets, 23K unique CVEs, and 11K unique notifications. Most notification descriptions and internal and customer comments consist of no more than 200, 70, and 50 words, respectively.

Enriched instructions through CSKG

Cybersecurity Knowledge graphs are known to link disparate but connected cybersecurity resources by means of relationship edges (Iannacone et al. 2015; Syed et al. 2016). Foremost among these knowledge graphs is the SEPSES which unifies and links vulnerability databases including CVE, CWE and CAPEC into a cybersecurity knowledge graph (CSKG) (Kiesling et al. 2019). CSKG is continuously updated and hence the links between all these resources are up to date and current. Utilizing the CSKG, we now enrich Notifications dataset with prerequisites, mitigations and typical severities of constituent vulnerabilities.

Although the CVE entries describe particular instances of vulnerabilities, their corresponding CWE entries describe general types of weaknesses that can lead to these vulnerabilities. The CAPEC database, on the other hand, provides actionable insights into how those weaknesses are exploited through specific attack techniques and patterns. In the CAPEC database, prerequisites refer to the specific conditions or requirements that must be met for an attack pattern to be successfully executed. Typical severity refers to the level of impact or potential damage that an attack pattern can typically inflict if successfully executed. Mitigations refer to strategies, techniques, or practices that can be employed to prevent or reduce the effectiveness of the attack patterns described.

We utilize the SEPSES knowledge graph to identify the CAPEC entries associated with the linked CWE entries relevant to the CVEs included in a notification to enrich the notification description.

Formation of SFT dataset

The format of the SFT dataset, as shown in Table 2, uses the standard Alpaca format (Taori et al. 2023). Each evaluation in the Evaluations dataset forms four entries in the SFT dataset. Table 3 defines the instruction for each evaluation type. An entry in the SFT dataset contains the corresponding instruction, a description of the asset from Assets dataset, the ontology-enriched description of the notification from Notifications dataset, the affected software components, CVSS base and temporal metric descriptions, and one of the corresponding assessments: VEXCategory (and VEXJustification), InternalComment, CustomerComment, and CVSS Vector description. CAPEC mitigations are added only for the generation of Internal Comment, while prerequisites and typical severity are added for all evaluation types. The SFT dataset finally goes through text cleaning, removal of incomplete evaluations and long texts (number of tokens >1024). Table 4 shows an example of an instruction in the SFT dataset.

TABLE 2 SFT dataset format for a typical instruction.

SFT dataset format
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction: instruction
### Input:
Organization: sub-organization name
Software: software name and version
Product: asset name and description
Notification: cleaned notification description
Prerequisites: cleaned description of Prerequisites from CAPEC
Typical severity: cleaned description of Typical severity from CAPEC
Mitigations: cleaned description of Mitigations from CAPEC
Components present in software: description of components common between the asset and the notification
Base and Temporal Vectors: description of base and temporal vectors in notification
CVSS Version: CVSS version used in notification
### Response: response
<STOP >

TABLE 3 Instructions for each evaluation type.

Evaluation	Instruction	Response
VEXCategory / VEXJustification	What is the category?	Justification. Category: Category
InternalComment	Generate internal comment.	Internal Comment
CustomerComment	Generate customer comment.	Customer Comment
Vector	Generate environmental vectors.	Environmental vector description

TABLE 4 SFT dataset example for generating InternalComment.

SFT dataset example
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction: Generate internal comments.
### Input:
Organization:
Software:
Product:
Notification: Uncontrolled search path in some Intel RST software may allow an authenticated user to potentially enable escalation of privilege via local access.
Prerequisites: The attacker must be able to write to redirect search paths on the victim host.
Typical severity: Very High.
Mitigations: (1) Implementation: Host integrity monitoring. (2) Design: Ensure that the program's compound parts, including all system dependencies, classpath, path, and so on, are secured to the same or higher level assurance as the program. (3) Design: Enforce principle of least privilege.
Components present in software: Intel Chipset Device Software - Intel - 10.1.1.44, Intel Graphics Drivers - Intel - 21.20.x, Intel Management Engine Components Installer Driver - Intel - 11.7.0.1043, Intel Network Connections - Intel - 25.0, Intel Trusted Connect Service Client - Intel - 1.47.715.0, Rapid Storage Technology (RST) - Intel - 15.7.x
Base and Temporal Vectors: Attack Vector is Local. Attack Complexity is High. Privileges Required is Low. User Interaction is Required. Scope is Unchanged. Confidentiality is High. Integrity is High. Availability is High. Exploit Code Maturity is Unproven. Remediation Level is Official Fix. Report Confidence is Confirmed.
CVSS Version: 3.1
### Response: The vulnerability deployed in the system is controlled. Exploitation of this vulnerability requires privileged local access and high Attack Complexity. Exploitability score (0.8) is below threshold. Device access is protected by username and password. The application is executed in Kiosk mode. PII is encrypted in the database. The database cannot be accessed remotely. Firewall rules are configured. The system is protected by whitelisting.
<STOP >

MODEL TRAINING

DAPT model training

The DAPT dataset is randomly split into a 90:10 split with 315K training data points and 35K validation data points. We have continuously pretrained the MPT-7B base model autoregressively for next token prediction with an objective function of cross entropy. We have expanded the vocabulary of the MPT-7B base model to include names of the components and the organization software and added 539 new tokens to the 50K vocabulary of MPT. The model is trained using DeepSpeed (Rasley et al. 2020) zero-3 optimization with the Lion optimizer (Chen et al. 2024) (learning rate: 1e-4, weight decay: 1e-2) for 3 epochs on 8 Tesla A100 SXM4 GPUs.

SFT model training

The SFT dataset has 750K unique instructions, and the dataset is randomly split into 600K training, 75K validation, and 44K test data points. The DAPT model is finetuned with the training and validation data by back propagating on completions only, while setting the token labels of the instructions to -100. Except for the learning rate(1e-5), similar optimization, hyperparamaters and GPU resources for training are used for instruction tuning.

DEPLOYMENT

The deployment framework includes generation of evaluations using the CVE-LLM, using predefined rules to avoid easily identifiable erroneous generations, using guardrails to assess the reliability of LLM outputs, and using faster inference frameworks to improve model throughput. The AI-generated vulnerability evaluations are validated by a product cybersecurity expert (with higher priority given to the evaluations with affected VEXCategory). The corrected human evaluations form the input for the subsequent retraining phase.

Generation of evaluations

Evaluation generation follows a zero-shot model inference where the model is provided a prompt in the pre-defined SFT instruction format, containing information about sub-organization name, software name and version, asset name and version, notification description, CAPEC entries, list of component descriptions for components present in both asset and notification, base and temporal versions, and CVSS version. The inference/test dataset is divided into two sub-datasets based on the size of tokens in the inference instruction. For instructions longer than 920 tokens, the sequence length of the trained model is increased to 150 + maximum token length of the longest instruction. If the size of the instruction is less than 920 tokens, the trained model without any changes is used to inference the dataset.

Rule-based correction

Post-processing of model outputs involves using CSMS-specific knowledge in the form of predefined rules to correct mistakes in model generation. The rules are as follows:

1.If VEXCategory value is NotAffected, the evaluation vector is not generated.
2.If VEXJustification generated is any text other than the pre-defined categories, ‘Other’ gets assigned to VEXJustification.
3.If VEXJustification is “Other” and Customer Comment is empty, then it is an error in generation. We use Internal Comment as External Comment as well in case it is not empty.
4.If VEXCategory is affected, VEXJustification is automatically set to None, and it is ensured that the Vector has the proper CVSS format.

Guardrails for assessing LLM outputs

We have observed that the LLM outputs can contain hallucinations, especially pertaining to hallucinations of named entities. We have adopted the following algorithm to implement guardrails to reduce hallucinations and verify the results.

1.Ensuring accuracy of named entities in generated text. We formed a Named Entity Recognition (NER) corpus of 1392 entities with entities of type OS (Operating System), vendor, product, version, Vulnerability ID, IP and Port. SecBERT⁶ was finetuned with these entities using a token-level cross-entropy loss applied to BIO-tagged labels. We used a learning rate of 2e-5, a batch size of 32, and trained for 5 epochs using the AdamW optimizer with linear learning rate decay and warmup. The model has F1 of 0.81, and we used a spaCy (Honnibal and Montani 2017) rule-based NER as well to bring the F1 up to 0.9. The NER model is used to detect all the named entities in generated text for InternalComment and CustomerComment and we have verified the correctness of the named entities generated by comparing with product documents and the incoming notification description. If the generated text gets all entities correctly except the version numbers, we edit the correct version numbers in. However, if the generated text does not include the core components, for example, correct ports, OS or IP, we refrain from sending the output and store in our database as incorrectly generated output.
2.Constrained output decoding. We have used constrained output decoding to ensure that the generated text for VEXCategory, VEXJustification and Vector are bounded by the pre-defined values for these outputs.
3.Self-reliability check. We have used OpenAI GPT-4o (Hurst et al. 2024) to reword the notification description. We used our trained NER to ensure that the generated description contains all the named entities in the original notification description. We use it asynchronously with batching to simultaeously generate five re-worded descriptions for each notification. We use CVE-LLM to generate VEXCategory, VEXJustification and Vector for the six notifications combined with the rest of the instruction components separately. If the agreement in results is higher than 80%, we integrate the results with CSMS system, otherwise we store them as incorrectly generated output.

Model result integration into the CSMS system

The model results are generated daily using vLLM architecture (Kwon et al. 2023) on one Tesla A100 SXM4 GPU via batch processing, using the newly arrived notifications from third-party component vendors and software versions of SHS products. The AI evaluations are generated for every applicable product that uses the third-party components relevant to the notifications, as well as for every past notification that is applicable to the product. These evaluations are stored in a database and made available to product cybersecurity experts who can auto-populate the CSMS with AI-recommended evaluations and/or revise the evaluations as needed. The corrected evaluations are subsequently archived for further processing for further training, in addition to investigating mitigation and determining the appropriate reply to customer.

RESULTS

The experiment results are reported on two datasets: 1. Test dataset with 44K evaluations. 2. Post-deployment dataset results (CVE-LLM-Prod) with 10K expert evaluations.

We use ROUGE-L and micro-F1 for evaluating the model responses: ROUGE-L (Lin and Och 2004) is used for evaluating the model-generated InternalComment and CustomerComment, whereas micro-F1 (Pedregosa et al. 2011) is used for VEXCategory, VEXJustification and (environmental) Vector.

Benchmarking with open source LLMs

We have used open source models that are of similar size as CVE-LLM and are top-performing models on LLM Leaderboards (Huggingface Leaderboard 2023). We trained them using the DAPT and SFT methods described in the Model Training section. In addition, we also performed RAG over our test dataset with Llama3-70B. We indexed the security documentation as well as historical evaluations for each product. We retrieved the relevant information separately from these two types of documents, and prompted the LLM, combining the retrieved information and the vulnerability evaluation questions. Table 5 enumerates the benchmarking results on our test dataset.

TABLE 5 Benchmarking CVE-LLM against state-of-the-art Open Source LLMs. RAG = Retrieveal Augmented Generation using Llama3 model, CVE-LLM = CVE-LLM results on test dataset (N=44K), CVE-LLM-Eval = deployed CVE-LLM results on test dataset (N=44K), Prod = post-deployment results of CVE-LLM on production(N=10K).

Evaluation	RAG	Mistral-7B	Llama2-7B	CVE-LLM	CVE-LLM-Eval	Prod
VEXCategory	0.25	0.64	0.56	0.94	0.94	0.86
VEXJustification	0.12	0.60	0.53	0.90	0.92	0.95
InternalComment	0.19	0.71	0.62	0.79	0.79	0.73
CustomerComment	0.28	0.70	0.64	0.88	0.89	0.79
Vector	0.13	0.59	0.57	0.96	0.98	0.98

CVE-LLM, which is based on MPT-7B model, has shown better performance than Llama2-7B and Mistral-7B-based models. The highest performance improvement is observed for classification-based generations: VEXCategory, VEXJustification and Vector. Post processing the model results have shown the highest improvement for VEXJustification and Vector. We also observe that the performance of CVE-LLM post-deployment (i.e., with completely new assets and notifications) is comparable to the test dataset results.

Benchmarking with encoder-based classification models

We trained SecBERT and CySecBERT (Bayer et al. 2024) with SFT dataset examples for VEXCategory, VEXJustification and Vector. Due to token constraints, we have used 70K training examples for each of these output types. We maintained diversity and equal representation of all divisions and third-party vendors and components. SecBERT and CySecBERT were trained with binary cross entropy loss for VEXCategory. VEXJustification model is a multiclass classification model, while Vector model is a multitask classification model. All the models were trained with a learning rate of 2e-5 and AdamW optimizer. Table 6 shows the results of the encoder models, compared with CVE-LLM.

TABLE 6 Test dataset Results: Results comparing encoder model results for classification tasks.

Evaluation	CVE- LLM	CySec- BERT	Sec- BERT
VEXCategory	0.94	0.92	0.93
VEXJustification	0.90	0.80	0.80
Vector	0.96	0.88	0.87

Ablation studies

We conducted ablation studies with respect to different components of model training and inference: (a) DAPT-SFT versus SFT-only system, (b) impact of beam size, temperature, and nucleus sampling during model inference, (c) sequence length during training, and (d) dataset size For each of these experiments, the test dataset consists of the same data points from the Evaluations dataset. The training data points have varied across the datasets, but we have made sure that all the possible assets are represented in the training datasets.

(a)Effect of Domain Adaptation. The results after Supervised Finetuning of Domain-adapted CVE-LLM-Base versus Supervised Finetuning of MPT-7B is shown in Table 7. Domain adaptation and vocabulary expansion leads to a much better performance for all the instruction types.
(b)Effect of Inference Parameters. In generative model inference, parameters such as temperature, beam size, and nucleus sampling significantly influence the quality and diversity of the generated outputs. Temperature controls the randomness of the output, with higher values leading to more diverse results. Nucleus Sampling (top-p sampling) dynamically adjusts the candidate set of tokens to balance between diversity and coherence based on the cumulative probability threshold “p”. Beam Size in beam search influences the depth of search for probable sequences, balancing between computational cost and output quality. Generally these parameters are adjusted to optimize the trade-off between diversity and coherence of the generated text. We performed inference on our test dataset using CVE-LLM-Eval model to conduct the experiment on the effect of inference parameters.
- Temperature and Nucleus Sampling. In order to test the effect of temperature, we maintained the generation parameters at their default settings and conducted tests by varying the temperature values from 0 to 1 in increments of 0.1 to evaluate the outcomes. Similarly, for top-p sampling too, we changed the parameter “top-p” from 0 to 1 in increments of 0.1, keeping all other generation parameters in default settings. For both of the experiments, no changes were observed.
- Beam Size. We measured the effect of beam size for each of the instruction types by changing the beam size from 1 to 19 in increments of 2, maintaining other generation parameters in their default values. We notice that beyond beam size 7, the performance of the system steadily declines for all the instruction types. The effect of change of beam size on model generation is shown in Figure 2F.
(c)Effect of Sequence Length in Training. In this experiment, we changed the sequence length in configuration of the model to 512, 750, and 1024, and reduced the training dataset to accommodate only the instructions that had less than the sequence length. Figure 3 shows the effect of training sequence length on the model outcome for all the instruction types. We find out that although the dataset size does not change much, there is more variation in performance. As training sequence length increases, the performance on datasets with longer instructions also improves.
(d)Model Performance Scaling with Training Dataset Size. We have experimented with a subset of the training dataset by changing size of the instruction tuning training dataset from 150K instructions to 440K instructions. The datasets are formed by making sure that all the types of assets and all the types of notifications are always represented in the dataset. The minimal dataset that affords this invariability is the 150K training dataset, and for the rest of the datasets we have randomly selected from the remaining data points. Figure 4 shows the rate of performance improvement with training dataset size.

TABLE 7 Test dataset Ablation Results: Results without DAPT and without ontology.

Evaluation	CVE- LLM	No DAPT	No ontology
VexCategory	0.94	0.83	0.93
VexJustification	0.90	0.80	0.89
Internal Comment	0.79	0.64	0.80
Customer Comment	0.88	0.69	0.88
Vector	0.96	0.84	0.95

[IMAGE OMITTED. SEE PDF]

We have noticed that Customer Comment shows the least variation in Rouge-L score, while VEXCategory shows the highest micro-F1 variance. In addition, the performance improvement stagnates for most instructions as we reach the 440K instruction dataset size mark.

Effect of ontology-assisted enrichment

We have tested the effect of ontology enrichment on our test dataset (Table 7) as well as deployment data (Figure 2) with 10K evaluations. The model has been trained with data until April and has been deployed since May. The effect in deployment shows the efficacy of the ontology-assisted enrichment.

Effect of post-processing

Both rule-based corrections and guardrails help in boosting performance in production. Rule-based methods have contributed to correction of 1.34% of the results, whereas guardrails have been able to flag 2.12% of the results as incorrect. Guardrail-based correction led to correcting 0.87% of the outputs in the deployment dataset.

Inference time

We have used different techniques to optimize for the time required in inference. The first experiment used popular model serving algorithm vLLM adapted for MPT-7B to find required inference time for one Tesla A100 SXM4 GPU for each of the instruction types and found speedup of 10x magnitude. The experiment results are shown in Table 8. In addition, the adaptive generation by dividing the inference batch into batches of small and large token lengths also improves the average performance over the test dataset to 4.5 s per evaluation from 8 s. The other adaptation of using different maximum sequence length for each instruction type brings down the inference time for each data point in the Evaluations dataset by 1.5 times. We have calculated the time required by a cybersecurity expert in our organization to evaluate vulnerabilities. The mean time required for an expert, averaged over 7K evaluations is 194 s with a median of 58 s, whereas an evaluation by an our model takes only 1 s per evaluation.

TABLE 8 Inference time in seconds on our test dataset for model serving with sequence length = 20K.

Evaluation	No speedup	vLLM
VEXCategory	0.5	0.04
Internal Comment	2.5	0.20
Customer Comment	2	0.19
Vector	3	0.32

DISCUSSION

In this section, we will elucidate our observations related to training and performance of LLMs for vulnerability evaluation, potential areas of improvement and future work.

Ontology enrichment

Although ontology enrichment did not yield substantial performance improvement on the test dataset, it demonstrated a significant impact during deployment on unseen notifications and assets. Most of the errors on unseen notifications pertain to the absence of a linking factor between a new notification and an older similar notification. A close examination show that for most notifications with similar prerequisites, the responses to the instructions are similar across assets. In the future, we would like to experiment with adding more knowledge about the notifications from other security knowledge graphs.

Domain adaptation

We observed significant improvement in performance with domain adaptation of the LLM. Error analysis for the model without domain adaptation shows more hallucinations, more generalized comment generation and lower performance for VEXJustification and Vectors that are less represented.

Performance across instructions

The performance of CVE-LLM in classification tasks is generally better than the generation tasks. However, though the F1 score for Vector classification is generally high, it remains low for underrepresented classes, resulting in a decline in overall performance. Similarly, performance drops from VEXCategory to VEXJustification classification, largely due to the presence of underrepresented classes in VEXJustification.

We surmise that the performance of CVE-LLM for CustomerComment is better than that in InternalComment due to the following factors:

We observed that approximately 85% of the customer comments follow a highly templated format and are repeated across multiple assets and multiple subdivisions for similar notifications, which is not the case for internal comments.
Internal comments tend to mention specific products, components, and versions, which is generally not seen in customer comments. CVE-LLM tends to hallucinate on named entities.

Generation parameters

We have noticed model output variations with beam size, with decline in performance with higher beam sizes. Increase in beam search size increases the diversity of text, and that leads to less contextually relevant outputs, as seen in our experiments. Low beam size is sufficient for high quality generation because of the specialized nature of the dataset. The difference between probabilities of less probable outcomes and highly probable outcomes, given the context, is low. This leads to no temperature or nucleus sampling-based variations in model-generated outputs.

Dataset size

With the increase in dataset size, we have seen a steady increase in performance in most evaluation types. VEXJustification has a lower rate of performance rise compared to the related evaluation type VEXCategory. CustomerComment shows the least boost in performance with time, and VEXCategory improves the most. This is mostly because CustomerComment show less variation across the same product type for similar kinds of CVEs. VEXCategory, on the other hand, is highly dependent on the CVEs contained in the notification, and the diversity in the dataset helps the model in understanding the notifications better. Similarly, the same products might share the same VEXJustification. Evaluation types that tend to be more product-dependent do not benefit much from an increase in dataset.

Error analysis across LLMs

We have used MPT-7B to train CVE-LLM for two reasons: (a) Expansion capability to at least 20K tokens without significant loss of performance, and (b) It performs better than other open source models we tested with our dataset. However, among all the LLMs we have trained, the majority of the issues we have encountered are remarkably similar, differing primarily in their severity. The errors we have encountered across LLMs can be of four types:

Hallucination in named entities: The generated text for both Internal Comment and Customer Comment has been observed to have omitted or falsely included several critical details, particularly those concerning the affected software versions, the recommended update versions, the names of the software or components, and other similarly named entities that constitute essential information. We plan to introduce custom losses during model training in the future that will counteract these hallucinations.
Spurious text generation: This error is more common for Llama2 model and was observed in MPT-7B before the introduction of the dedicated STOP token.
Performance degradation due to long text: In spite of having the ability to extend the sequence length of the trained model beyond the training sequence length, the performance on long sequences still falls behind. Long instructions comprises of multiple vulnerabilities, and constitute only 10% of our test dataset. In the future, we plan to segment it into multiple vulnerabilities and use more robust semantic understanding of each vulnerability to perform vulnerability chaining. We also tested the system only with zero shot instructions at the inference, and we surmise that we can utilize language understanding capabilities better by using few shot methods to assess vulnerabilities for assets.
Errors due to different patterns of task responses in organization subdivisions: The variation of task responses across subdivisions has been seen to affect the quality of generation, though the instruction input context includes the subdivision name. This is an issue especially when a notification has vastly more representation in another subdivision. In the future we would attempt to address this issue with data augmentation and sampling techniques.

RELEVANT WORK

Language models have been used extensively in vulnerability management (a) to determine CVSS metrics from CVE descriptions, (b) to establish a mapping between vulnerabilities in CVE database with the corresponding attack tactics and techniques in ATT&CK database⁷, (c) for vulnerability detection, and (d) for vulnerability repair. Determination of CVSS metrics using vulnerability description has been treated as a text classification/regression problem. Mapping to CVSS score follows a linear regression using Bag-of-Words model (Elbaz, Rilling, and Morin 2020) or neural network model using Doc2Vec (Vasireddy et al. 2023) method of extracting features from a document. CVSS-BERT (Shahid and Debar 2021), on the other hand, trained different BERT models for classification of CVE descriptions to the values of different CVE vectors. Both encoder and decoder-based models have been used for mapping vulnerability to ATT&CK tactics and techniques. CVET (Ampel et al. 2021), a RoBERTa (Liu et al. 2019)-based model, classified CVE descriptions to one of the ten tactics in ATT&CK, whereas SMET (Abdeen et al. 2023) used BERT-based textual similarity to map CVE entries to ATT&CK techniques. Although ChatGPT-based approaches did not yield state-of-the-art results (Liu et al. 2023), VTT-LLM (Zhang et al. 2024), trained using various versions of the decoder Bloom model (Le Scao et al. 2023) with Chain-of-thought (Wei et al. 2022) instructions, incorporated relations between core concepts in CWE (Christey et al. 2013) and CAPEC databases⁸ for CVE to ATT&CK mapping and surpassed encoder-based models. Other work (Yosifova, Tasheva, and Trifonov 2021) involved CVE Vulnerability type classification using TF-IDF and standard machine learning classifier.

For vulnerability detection, encoder-based LLMs (Ameri et al. 2021; Yin et al. 2020) have been instrumental, particularly through innovative approaches like the pretrain-and-finetune paradigm. These approaches have also leveraged novel pretraining strategies, the integration Graph Neural Networks (Sewak, Emani, and Naresh 2023) or Long Short-Term Memory (LSTM) (Hassanin et al. 2023) networks, prompt tuning, program analysis, specialized calculus-based causal reasoning, and knowledge graph-based reasoning. In recent times, the focus has shifted toward decoder-only LLMs (Zhou et al. 2024) for vulnerability detection. Our work focuses on using pretrain-and-finetune approach with decoder-based LLMs, coupled with ontology enrichment for vulnerability evaluation generation.

CONCLUSION

This study illustrates the capacity of large language models (LLMs) to learn from expert-curated historical vulnerability evaluation data, thereby enabling the automation of vulnerability evaluation generation for MDMs. Our model (CVE-LLM) has proven to be effective in learning from past evaluations and has shown high accuracy in prediction of VEXCategory and CVSS Vectors. The model's inference is also substantially faster, achieving speeds approximately 50–100 times greater than those of a human expert. The system is deployed in a human-in-the-loop manner, assisting product cybersecurity experts in swiftly identifying vulnerabilities that affect the assets and communicating mitigations promptly to the customer.

In future work, we intend to investigate additional knowledge infusion techniques, incorporating a broader range of cybersecurity databases. Addressing the challenge of hallucinations remains a key focus. Furthermore, we aim to develop mechanisms for integrating more comprehensive product knowledge by leveraging source code and official cybersecurity documentation. Beyond standardized databases, numerous blogs and websites provide continuous updates on vulnerabilities, which we intend to incorporate into our DAPT and SFT training paradigms. The success of CVE-LLM in the SHS vulnerability evaluation platform has also highlighted opportunities to assist cybersecurity experts in vulnerability mitigation.

DISCLAIMER

The concepts and information presented in this paper/presentation are based on research results that are not commercially available. Future commercial availability cannot be guaranteed.

CONFLICT OF INTEREST STATEMENT

Rikhiya Ghosh and Sanjeev Kumar Karn are Senior NLP Scientist and Senior Key Expert, respectively, in Siemens Healthineers Digital Technology and Innovation Center in Princeton, New Jersey, USA. Hans-Martin von Stockhausen is Principal Key Cybersecurity Expert in Siemens Healthineers. Martin Schmitt is Threat and Vulnerability Expert in Siemens Healthineers Cybersecurity. George Marica Vasille is an AI Engineer at Siemens Corporate Technology Romania, and Oladimeji Farri is a Senior Vice President at Get Well.

References

Abdeen, B., E. Al‐Shaer, A. Singhal, L. Khan, and K. Hamlen. 2023. “Smet: Semantic Mapping of Cve to Att&ck and Its Application to Cybersecurity.” In IFIP Annual Conference on Data and Applications Security and Privacy, 243–260. Springer.

Ameri, K., M. Hempel, H. Sharif, J. Lopez Jr, and K. Peru‐malla. 2021. “Cybert: Cybersecurity Claim Classification by Fine‐Tuning the Bert Language Model.” Journal of Cybersecurity and Privacy 1(4): 615–637.

Ampel, B., S. Samtani, S. Ullman, and H. Chen. 2021. “Linking Common Vulnerabilities and Exposures to the Mitre Att&ck Framework: A Self‐Distillation Approach.” arXiv preprint arXiv:2108.01696.

Bayer, M., P. Kuehn, R. Shanehsaz, and C. Reuter. 2024. “Cysecbert: A Domain‐Adapted Language Model for the Cybersecurity Domain.” ACM Transactions on Privacy and Security 27(2): 1–20.

Chen, X., C. Liang, D. Huang, E. Real, K. Wang, H. Pham, X. Dong, T. Luong, C. J. Hsieh, Y. Lu, et al. 2024. “Symbolic discovery of optimization algorithms.” Advances in Neural Information Processing Systems 36.

Christey, S., J. Kenderdine, J. Mazella, and B. Miles. 2013. Common weakness enumeration. Mitre Corporation.

Deng, G., Y. Liu, V. Mayoral‐Vilches, P. Liu, Y. Li, Y. Xu, T. Zhang, Y. Liu, M. Pinzger, and S. Rass. 2024. “PentestGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing.” In 33rd USENIX Se‐curity Symposium (USENIX Security 24). USENIX Association.

Devlin, J., M.‐W. Chang, K. Lee, and K. Toutanova. 2018. “Bert: Pre‐training of Deep Bidirectional Transformers for Lan‐Guage Understanding.” arXiv preprint arXiv:1810.04805.

Elbaz, C., L. Rilling, and C. Morin. 2020. “Fighting N‐Day Vulnerabilities With Automated CVSS Vector Prediction at Disclosure.” In Proceedings of the 15th International Con‐ference on Availability, Reliability and Security, 1–10. Association for Computing Machinery.

Ghosh, R., H. M. von Stockhausen, M. Schmitt, G. M. Vasile, S. K. Karn, and O. Farri. 2025. “April. CVE‐LLM: Ontology‐assisted Automatic Vulnerability Evaluation Using Large Language Models.” In Proceedings of the AAAI Conference on Artificial Intelligence vol. 39, no. 28, 28757–28765. AAAI.

Happe, A., and J. Cito. 2023. “Getting pwn'd by Ai: Penetration Testing With Large Language Models.” In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2082–2086. Association for Computing Machinery.

Hassanin, M., M. Keshk, S. Salim, M. Alsubaie, and D. Sharma. 2024. “PLLM‐CS: Pre‐Trained Large Language Model (LLM) for Cyber Threat Detection in Satellite Networks.” arXiv preprint arXiv:2405.05469.

Honnibal, M., and I. Montani. 2017. spaCy 2: Natural Language Understanding With Bloom Embeddings, Convolutional Neural Networks and Incremental Parsing.

Huggingface. 2023. Open LLM Leaderboard.

Hurst, A., A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. J. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. 2024. “Gpt‐4o system card.” arXiv preprint arXiv:2410.21276.

Iannacone, M., S. Bohn, G. Nakamura, J. Gerth, K. Huffer, R. Bridges, E. Ferragut, and J. Goodall. 2015. “Developing an Ontology for Cyber Security Knowledge Graphs.” In Proceedings of the 10th annual cyber and information security research conference, 1–4.

Kiesling, E., A. Ekelhart, K. Kurniawan, and F. Ekaputra, 2019. “The SEPSES Knowledge Graph: An Integrated Resource for Cybersecurity.” In International Semantic Web Conference, 198–214. Springer.

Kwon, W., Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. 2023. “Efficient Memory Management for Large Language Model Serving With Paged Attention.” In Proceedings of the 29th Symposium on Operating Systems Principles, 611–626.

Le Scao, T., A. Fan, C. Akiki, E. Pavlick, S. IliÂ´c, D. Hesslow, R. CastagnÂ´e, A. S. Luccioni, F. Yvon, M. GallÂ´e, et al. 2023. Bloom: A 176b‐parameter open‐access multilingual language model.

Li, Y., S. Wang, H. Ding, and H. Chen. 2023. “Large Language Models in Finance: A Survey.” In Proceedings of the Fourth ACM International Conference on AI in Finance, 374–382. Association for Computing Machinery.

Lin, C.‐Y., and F. Och. 2004. “Looking for a Few Good Metrics: ROUGE and Its Evaluation.” In NTCIR workshop.

Liu, X., Y. Tan, Z. Xiao, J. Zhuge, and R. Zhou. 2023. “Not the End of Story: An Evaluation of Chatgpt‐Driven Vulnerability Description Mappings.” In Findings of the Association for Computational Linguistics: ACL 2023, 3724–3731.

Liu, Y., M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. “Roberta: A Robustly Optimized Bert Pretraining Approach.” arXiv preprint arXiv:1907.11692.

Mell, P., K. Scarfone, and S. Romanosky. 2006. “Common Vulnerability Scoring System.” IEEE Security & Privacy 4(6): 85–89.

CVE MITRE. Common vulnerabilities and exposures.

Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. 2011. “Scikit‐Learn: Machine Learning in Python.” Journal of Machine Learning Research 12: 2825–2830.

Rasley, J., S. Rajbhandari, O. Ruwase, and Y. He. 2020. “Deepspeed: System Optimizations Enable Training Deep Learning Models With Over 100 Billion Parameters.” In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 3505–3506. Association for Computing Machinery.

Sallam, M. 2023. “ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns.” In Healthcare, vol. 11, 887. MDPI.

Sewak, M., V. Emani, and A. Naresh. 2023. CRUSH: Cybersecurity Research using Universal LLMs and Semantic Hypernetworks.

Shahid, M. R., and H. Debar. 2021. “CVSS‐BERT: Explainable Natural Language Processing to Determine the Severity of a Computer Security Vulnerability From Its Description.” In 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), 1600–1607. IEEE.

Syed, Z., A. Padia, T. Finin, L. Mathews, and A. Joshi. 2016. “UCO: A Unified Cybersecurity Ontology.” In Workshops at the thirtieth AAAI conference on artificial intelligence.

Taori, R., I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. 2023. “Alpaca: A Strong, Replicable Instruction‐Following Model.” Stanford Center for Research on Foundation Models 3(6): 7.

MosaicML NLP Team 2023. Introducing MPT‐7B: A New Standard for Open‐Source, Commercially Usable LLMs. Accessed: 2023‐05‐05.

Vasireddy, D. T., D. S. Dale, and Q. Li. 2023. “CVSS Base Score Prediction Using an Optimized Machine Learn‐ing Scheme.” In 2023 Resilience Week (RWS), 1–6. IEEE.

Wei, J., X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. 2022. “Chain‐of‐Thought Prompting Elicits Reasoning in Large Language Models.” Advances in Neural Information Processing Systems 35: 24824–24837.

Wu, Q., G. Bansal, J. Zhang, Y. Wu, S. Zhang, E. Zhu, B. Li, L. Jiang, X. Zhang, and C. Wang. 2023. “Autogen: Enabling Next‐Gen llm Applications via Multi‐Agent Conversation Framework.” arXiv preprint arXiv:2308.08155.

Yang, X., Z. Wang, Q. Wang, K. Wei, K. Zhang, and J. Shi. 2024. “Large language models for automated q&a involving legal documents: A survey on algorithms, frameworks and applications.” International Journal of Web Information Systems 20(4): 413–435.

Yin, J., M. Tang, J. Cao, and H. Wang. 2020. “Apply Transfer Learning to Cybersecurity: Predicting Exploitability of Vulnerabilities by Description.” Knowledge‐Based Systems 210: 106529.

Yosifova, V., A. Tasheva, and R. Trifonov. 2021. “Predicting Vulnerability Type in Common Vulnerabilities and Exposures (cve) Database With Machine Learning Classifiers.” In 2021 12th National Conference with International Participation (ELECTRONICA), 1–6. IEEE.

Zhang, C., L. Wang, D. Fan, J. Zhu, T. Zhou, L. Zeng, and Z. Li. 2024. “VTT‐LLM: Advancing Vulnerability‐to‐Tactic‐and‐Technique Mapping Through Fine‐Tuning of Large Language Model.” Mathematics 12(9): 1286.

Zhou, X., S. Cao, X. Sun, and D. Lo. 2024. “Large Language Model for Vulnerability Detection and Re‐pair: Literature Review and Roadmap.” arXiv preprint arXiv:2404.02525.

Word count: 7115

Show less

© 2025. This work is published under http://creativecommons.org/licenses/by-nc/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

The National Vulnerability Database (NVD) publishes over a thousand new vulnerabilities monthly, with a projected 25 percent increase in 2024, highlighting the crucial need for rapid vulnerability identification to mitigate cybersecurity attacks and save costs and resources. In this work, we propose using large language models (LLMs) to learn vulnerability evaluation from historical assessments of medical device vulnerabilities in a single manufacturer's portfolio. We highlight the effectiveness and challenges of using LLMs for automatic vulnerability evaluation and introduce a method to enrich historical data with cybersecurity ontologies, enabling the system to understand new vulnerabilities without retraining the LLM. Our LLM system integrates with the in‐house application—Cybersecurity Management System (CSMS)—to help Siemens Healthineers (SHS) product cybersecurity experts efficiently assess the vulnerabilities in our products. Also, we present a comprehensive set of experiments that helps showcase the properties of the LLM and dataset, the various guardrails we have implemented to safeguard the system in production, and the guidelines for efficient integration of LLMs into the cybersecurity tool.

Details

Title

Automated vulnerability evaluation with large language models and vulnerability ontologies

Author

Ghosh, Rikhiya¹

; von Stockhausen, Hans‐Martin²; Schmitt, Martin²; Vasile, George Marica³; Karn, Sanjeev Kumar¹; Farri, Oladimeji⁴

¹ Siemens Healthineers, Princeton, New Jersey, USA
² Siemens Healthineers, Erlangen, AG, Germany
³ Siemens, AG, Romania
⁴ GetWell inc.,

Section

SPECIAL TOPIC ARTICLE

Publication year

2025

Publication date

Sep 1, 2025

Publisher

John Wiley & Sons, Inc.

ISSN

07384602

e-ISSN

23719621

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1002/aaai.70031

ProQuest document ID

3251122462

Automated vulnerability evaluation with large language models and vulnerability ontologies

Jump to:

Full text

Abstract

Details

Suggested sources