Content area
Precisely evaluating text similarity remains a fundamental challenge in Natural Language Processing (NLP), with widespread applications in plagiarism detection, information retrieval, semantic analysis, and recommendation systems. Traditional approaches often suffer from overfitting, local optima stagnation, and difficulty capturing deep semantic relationships. To address these challenges, this paper introduces an Intelligent Text Similarity Assessment Model that integrates Robustly Optimized Bidirectional Encoder Representations from Transformers (RoBERTa) with Chaotic Sand Cat Swarm Optimization (CHSCSO), a novel swarm intelligence-based optimization method inspired by chaotic dynamics. The model leverages RoBERTa’s robust contextual embeddings to extract deep semantic representations while utilizing CHSCSO’s controlled chaotic perturbations to optimize hyperparameters dynamically. This integration enhances model generalization, mitigates overfitting, and improves the trade-off between exploration and exploitation during training. CHSCSO refines the parameter search space by employing chaotic maps, ensuring a more adaptive and efficient training process. Extensive experiments on multiple benchmark datasets, including Semantic Textual Similarity (STS) and Textual Entailment (TE), demonstrate the model’s superiority over standard RoBERTa fine-tuning and conventional baselines that reach cosine similarity scores that are clustered at 0.996. The optimized model achieves higher accuracy and improved stability and exhibits faster convergence in text similarity tasks.
Introduction
Semantic Textual Similarity (STS) is a fundamental task in NLP aimed at quantifying the semantic closeness between two pieces of text. It is crucial in various applications, including information retrieval, answering questions, and summarization. In recent years, deep learning, particularly transformer-based models, has significantly improved the accuracy of STS systems. The proposed model uses RoBERTa’s robust contextual embedding to capture deep text semantics and CHSCSO's controlled chaotic perturbations to optimize parameters. Integration helps the model escape local optima, generalize, and sustain semantic variations. CHSCSO dynamically modifies the search space using chaotic maps to balance exploration and exploitation for optimal parameter configurations. Addressing these constraints is crucial to improving STS models in real-world applications as research advances. Recent transformer-based designs like BERT and optimized derivatives like RoBERTa [1] are commonly used to build context-aware embeddings. These models struggle with high-dimensional optimization and semantic drift, which makes STS fine-tuning an active research field [2]. Researchers have improved sentence embeddings by using dictionary-based methods [3], optimizing contrastive learning [4], and introducing novel augmentation techniques [5]. STS models' capacity to discern nuanced textual links has been improved while lowering dependence on large, labeled datasets. However, enhancing generalizability across varied datasets and computing efficiency in large-scale deployments remain issues. Talaat [6] showed how hybrid BERT models can improve sentiment analysis classification and text similarity tasks. STS performance is also affected by the robustness of similarity metrics. Traditional cosine similarity metrics are widely used, but saturation limits their efficacy in high-dimensional domains. Alternatives like angle-optimized embeddings and graph-based similarity metrics [8] address these issues. Despite these advances, researchers are still developing new optimization methods to increase STS model performance. Cross-lingual STS research is expanding as multilingual NLP applications become more popular. Although models like mBERT and XLM-R can handle several languages, matching meaning across linguistic models is difficult. Recent research demonstrates hybrid training methods using monolingual and multilingual datasets can increase cross-lingual transferability and STS model robustness in real-world contexts. Text similarity predictions can be improved by including reinforcement learning in STS models. Reinforcement learning lets models dynamically alter similarity scores depending on feedback, enhancing search ranking. Using reward-based optimization techniques, STS models improve accuracy and adaptability over static transformer-based embeddings. This research presents a novel RoBERTa-CHSCSO model [8]. We seek to solve the optimization problems of fine-tuning large-scale STS models by integrating RoBERTa's contextual knowledge with CHSCSO’s increased search capabilities. CHSCSO introduces chaotic maps to improve convergence and exploration, allowing for a more efficient and accurate determination of semantic similarity. This hybrid model is expected to yield improvements in both computational efficiency and the quality of similarity predictions. Paper contributions are summarized as follows:
Employing chaotic maps to improve the convergence and exploration of RoBERTa’s fine-tuning process, addressing challenges related to high-dimensional optimization and semantic drift.
Reducing the computational burden of traditional transformer-based fine-tuning by optimizing parameter selection and search space exploration using CHSCSO.
Introducing a novel hybrid model that leverages RoBERTa’s contextual understanding with CHSCSO’s enhanced search capabilities for optimizing text similarity assessment.
The remainder of the paper is structured as follows: Sect. "RELATED WORKS" reviews related works in STS and optimization techniques, Sect. "METHODS AND MATERIALS" presents methods and materials, Sect. "Result and Experimental" outlines the experimental setup, Sect. "Discussion" discusses the results, and Sect. "Conclusion" concludes the study with future research directions.
Related Works
In this section, we explore several state-of-the-art techniques that contribute to the advancement of STS. Punditariat et al. [10] introduced a novel embedding space decomposition method called MixSP, which separates sentence pairs into upper-range and lower-range categories based on similarity scores. On numerous STS datasets, their strategy reduced category overlaps and improved similarity ranking with an average Spearman's rank correlation score of 85.49%. According to the authors MixSP distinguished very similar and dissimilar sentence pairings better than earlier techniques that regarded similarity as a spectrum. Kachwala et al. [11] established REMATCH, a robust and efficient metric for comparing Abstract Meaning Representations. The study showed that AMR similarity metrics like Smatch, S2match, and Sembleu are computationally inefficient and inaccurate at capturing semantic content. REMATCH enhances structural and semantic similarity using motif-based graph similarity. Five times faster than Smatch, REMATCH surpassed other semantic consistency metrics with 67.72% accuracy on the STS-R test. REMATCH handled huge AMRs well and was helpful in natural language processing applications like question answering and summarization. Shu and Lampos [12] introduced Unsupervised Hard Negative Augmentation (UNA) using the TF-IDF retrieval technique to generate synthetic negative samples in contrastive learning. They generated challenging negative cases based on sentence phrase relevance to improve sentence similarity tasks. UNA outperformed standard augmentations on the STS-B benchmark with a Spearman correlation of 0.7614 with BERT. Their method shows that UNA improves contrastive learning and that hard negatives can improve sentence representation learning. DEF2VEC, a dictionary-based word embedding approach, was developed by Morazzoni et al. [13]. They used structured lexical information from WIKTIONARY definitions to improve static words embedding efficiency and extensibility. The authors showed that DEF2VEC’s LSA-derived embeddings beat WORD2VEC, GLOVE, and FASTTEXT in POS tagging, NER, and chunking. DEF2VEC had competitive POS and NER accuracy of 72.42% and 71.98%, respectively. Their findings show that dictionary-based embeddings can improve semantic representation, notably for out-of-vocabulary words, without model retraining. Wu et al. [4] established Adversarial Self-Attention (ASA) to reduce spurious features in Transformer-based language models. The study added adversarial bias to the self-attention mechanism to improve BERT, RoBERTa, and DeBERTa generalization and robustness. The researchers evaluated ASA for sentiment analysis (SST-2), natural language inference (MNLI, QNLI), semantic similarity (QQP, STS-B), entity recognition (WNUT-17), and machine reading comprehension. The proposed strategy improved accuracy to 96.3% on SST-2 and 88.0% on MNLI. ASA outperformed standard FreeLB and SMART in generalization and efficiency. The paper suggests optimizing ASA’s computational efficiency and applying it to different architectures to improve model robustness. Chuang et al. [3] presented DiffCSE, a difference-based contrastive learning framework for sentence embeddings. Traditional contrastive models encourage invariance to all augmentations, but DiffCSE makes sentence embeddings sensitive to significant syntax changes. Equivariant contrastive learning was created using combined dropout-based augmentation (SimCSE) and masked language model (MLM)-based word replacement. DiffCSE outperformed SimCSE by 2.3 absolute points on seven semantic textual similarity (STS) tasks and seven transfer learning tasks from the SentEval benchmark. DiffCSE-RoBERTa, the top model, improved Spearman's correlation from 76.25% (SimCSE) to 78.49% (DiffCSE) on the STS-Benchmark and had 87.04% SentEval classification accuracy. These findings show that equivariant contrastive learning improves sentence representation and downstream NLP.
Deshpande et al. [5] introduced Conditional Semantic Textual Similarity (C-STS), a new method for STS ambiguity resolution. Instead of semantic similarity, the study suggested analyzing sentence similarity based on natural language features. The researchers created the C-STS-2023 dataset from 18,908 annotated sentence pairings from MS COCO (83,000 images) and Flickr30K (31,000 images). The dataset was curated using Mechanical Turk image-caption retrieval, filtering, and annotations. Modern NLP models, including SimCSE, RoBERTa, GPT-4, and CLIP-ViT, were tested utilizing bi-, cross-, and tri-encoder architectures. SimCSE-Large in a bi-encoder setup performed best, with a Spearman correlation of 47.5, showing space for sentence representation learning improvement. The paper recommends improving model designs and training methodologies for fine-grained semantic similarity tasks. To overcome high STS annotation discrepancy, Wang et al. [14] created the USTS dataset. The study examined how current STS models misrepresent human judgment variability and presented a dataset of around 15,000 Chinese sentence pairs and 150,000 labels. Several machine learning methods were evaluated, including BERT-based models, Sentence-BERT, GMM, and Bayesian uncertainty estimation. According to the study, traditional STS models represent predicted confidence over the dataset rather than human annotation variance. The best model, BERT-lr, had a 0.86 Pearson correlation and a 0.69 Spearman's rank correlation. The study stressed the relevance of modeling uncertainty in STS and suggested multilingual extension and its use in other subjective NLP tasks. Al Sulaiman et al. [15] used transfer learning to explore STS for MSA and Arabic dialects. Three methods were proposed to overcome the lack of high-quality Arabic STS datasets and dialectal Arabic processing issues. The first method fine-tuned Arabic STS models by automatically translating English STS datasets into Arabic. The second method improved performance by interleaving English STS data with Arabic BERT models. The third model employed knowledge distillation-based models fine-tuned with a manually translated dataset of 1.3K sentence pairs. Various deep learning techniques, including BERT, SBERT, ARBERT, paraphrase-multilingual-mpnet-base-v2, and distiluse-base-multilingual-cased-v2, were evaluated. The study achieved a correlation of 81% for MSA, 77.5% for Egyptian Arabic, and 76% for Saudi Arabic using the STS 2017 Arabic evaluation set. The findings highlight the effectiveness of transfer learning for low-resource languages and suggest future expansion to additional Arabic dialects and applying the models to downstream NLP tasks. Table 1 summarizes recent studies of STS in terms of highlighting their methodologies, datasets, and reported accuracy values. To contextualize our contribution within the broader field, we present a comparative analysis highlighting the distinct features of our proposed CHSCSO-RoBERTa model in relation to recent state-of-the-art methods, including MixSP [10], REMATCH [11], UNA [12], and DEF2VEC [13]. To contextualize our contribution within the broader field, we present a comparative analysis highlighting the distinct features of our proposed CHSCSO-RoBERTa model in relation to recent state-of-the-art methods, including MixSP [10], REMATCH [11], UNA [12], and DEF2VEC [13]. Unlike MixSP, which partitions the embedding space into predefined similarity ranges to improve ranking, our approach leverages chaos-enhanced optimization to adaptively refine semantic representations. This allows for greater flexibility across a continuous range of similarity levels without the need for explicit space decomposition. In comparison to REMATCH, which enhances Abstract Meaning Representation (AMR) graph matching using motif-based structures, our model directly optimizes raw contextual embeddings from RoBERTa. This design enables broader applicability and improved scalability by avoiding the computational overhead associated with graph parsing. The UNA model focuses on generating hard negative samples to improve contrastive learning.
Table 1. A summary of recent studies in STS
Researchers | Methodology | Techniques | Dataset_Name | AccuraciesValues |
|---|---|---|---|---|
Ponwitayarat W, et al. [10]/(2024) | Introduced MixSP (Mixture of Specialized Projectors), a method to decompose the embedding space into two distinct parts | MixSP (Mixture of Specialized Projectors) | STS-B (Semantic Textual Similarity Benchmark) | Accuracy: CDSC-R(Test): 86.95, 90.42 (Avg.) |
Kachwala Z, et al. [11]/(2024) | Rematch algorithm for AMR similarity | REMATCH (Robust Efficient Matching) | structural (RARE) and semantic (STS-B, SICK-R) AMR 3.0 | RARE: 95.01 STS-B: 73.95 SICK-R: 71.01 |
Shu Y, Lampos V. [12]/(2024) | Introduced Unsupervised Hard Negative Augmentation (UNA) using TF-IDF for better negative sample generation | UNA (Unsupervised Hard Negative Augmentation) | STS Benchmark, SICK Relatedness | Spearman’s correlation on STS-B: 0.7614, SICK Relatedness: 0.7820 SentEval—STS tasks: RoBERTa: STS15(.8334) |
Morazzoni I, et al. [13]/(2023) | DEF2VEC model that constructs word embeddings using Latent Semantic Analysis (LSA) on dictionary definitions | DEF2VEC (Dictionary-based Embedding with LSA) | WIKTIONARY (English version) | For DEF2VEC Accuracy on POS task achieves accuracy: 72.42% (test), NER task: 71.98% (test), CHUNK task: 77.69% |
Li X, Li J. [7]/(2023) | Introduced AnglE (Angle-Optimized Embeddings) to optimize angle differences in complex space | AnglE (Angle-Optimized Embeddings) | STS Benchmark, GitHub Issues Similarity | AnglE shows an average Spearman correlation of AnglE-BERT: 73.55% (Non-transfer STS tasks) AnglE-LLaMA: 95.28(SST2), 91.38(Avg) |
Wu et al. [4]/(2023) | Uses ASA to bias self-attention, reducing reliance on spurious tokens | ASA, BERT, RoBERTa, DeBERTa | SST-2, MNLI, QNLI, QQP, STS-B, WNUT-17, DREAM, ANLI, | RoBERTa-ASA achieved the highest accuracy: 96.3% on SST-2 |
Yung-Sung Chuang et al. [3]/(2022) | Uses equivariant contrastive learning by combining SimCSE-style dropout augmentation with a difference prediction loss | DiffCSE, SimCSE, ELECTRA-based discriminator | STS 2012–2016, STS-Benchmark, SICK-Relatedness, SentEval tasks | DiffCSE-RoBERTa achieved highest accuracy: 87.04% on SentEval tasks |
Ameet Deshpande et al. [5]/(2023) | Uses free-form conditions to determine sentence similarity, annotating data through crowdsourcing | SimCSE, RoBERTa, GPT-4, CLIP-ViT | C-STS-2023 | Best model: SimCSE-Large Bi-encoder (47.5 Spearman) |
Yuxia Wang et al. [14]/2023 | Introduced USTS dataset with ∼15,000 Chinese sentence pairs and 150,000 labels, analyzed annotation disagreements, and assessed model reliability in capturing human variance | BERT-based models, Sentence-BERT, Gaussian Mixture Model (GMM), Bayesian uncertainty estimation | USTS (Uncertainty-aware STS) | For USTS-U: BERT-lr-MC model Pearson correlation (r): 0.861 Spearman rank correlation (ρ):0.697 |
Al Sulaiman et al. [15]/2022 | Three approaches: (1) Machine translation of English STS data to Arabic, (2) Interleaving English STS data with Arabic BERT models, (3) Fine-tuning knowledge distillation-based models with translated datasets | BERT, SBERT, ARBERT, paraphrase-multilingual-mpnet-base-v2, distiluse-base-multilingual-cased-v2, paraphrase-xlm-r-multilingual-v1 | STS 2017 Arabic set | paraphrase-multilingual-mpnet-base-v2 model achieved 81% correlation 77.5% for EG dialect and 76% |
Li X, Li J. [16]/2024 | AoE embedding method | AoE | STS datasets, namely: STS, SICK-R, and STS-B | AoE-LLaMA13B: TREC 96.60 ± 0.60 |
In contrast, our approach complements such sampling strategies by employing CHSCSO to optimize similarity thresholds and enhance embedding discriminability, rather than relying solely on sample generation. DEF2VEC constructs dictionary-based static embeddings, which lack contextual flexibility. Our model addresses this limitation by utilizing contextualized embeddings from RoBERTa, further refined through CHSCSO to better capture dynamic semantic nuances and mitigate issues related to polysemy. Furthermore, recent advancements in semantic similarity and textual entailment have explored a variety of sophisticated strategies, including multi-strategy ensembles with external knowledge integration, contrastive structural learning, chaos-enhanced fuzzy similarity models, hybrid metaheuristics, and transformer–graph neural network fusion techniques [17, 18, 19, 20–21]. The proposed work aligns with and extends these innovations by introducing a scalable and theoretically grounded optimization framework that enhances contextual embedding effectiveness with minimal computational overhead.
Methods and materials
We propose an intelligent text similarity assessment framework that leverages the RoBERTa model, as shown in Fig. 1, enhanced with an Integrated CHSCSO technique to improve similarity evaluation performance. Traditional deep learning-based text similarity models often face challenges related to redundancy, suboptimal parameter selection, and sensitivity to noise. The proposed method follows a structured pipeline of data preprocessing, RoBERTa-based feature extraction, chaotic perturbation optimization, and similarity computation. Initially, textual data undergoes tokenization and transformation using RoBERTa’s pre-trained language representations. Subsequently, CHSCSO introduces controlled perturbations to enhance feature representation, mitigate overfitting, and refine similarity scores. The following subsections present CHSCSO, an overview of the SCSO technique, integrating chaos theory with SCSO, BERT, and RoBERTa.
[See PDF for image]
Fig. 1
The main proposed work architecture
Chaotic sand cat swarm optimization (CHSCSO)
CHSCSO [9] is an innovative hybrid meta-heuristic technique for complicated and constrained optimization issues [23, 24]. This technique integrates the chaos theory [24, 25] with the characteristics of the recently released SCSO [26]. The main goal is to enhance global search effectiveness and convergence rate by including the chaos characteristic in non-recurring sites in SCSO's core search operation. Due to identical randomness features with superior statistical and dynamic characteristics, a chaotic map can thus replace randomness in SCSO. Problems with a small population variety, inefficient search, optimum local traps, and low search consistency accompany these benefits. Various chaotic maps are utilized in the CHSCSO to achieve enhanced outcomes during the exploration and exploitation stages. Experiments are performed on a wide range of popular evaluation functions and real-life issues to improve the credibility of the results. CHSCSO was employed on thirty-nine functions and multiple disciplines challenges, and 76.3% greater results were observed than other chaotic-based meta-heuristic tests and a best-developed SCSO variant. This comprehensive examination shows that the CHSCSO technique performs well in producing acceptable outcomes, so the proposed work aims to utilize CHSCSO as an optimizer for RoBERTa [1] rather than any technique to improve text similarity. A comprehensive explanation of this technique can be provided in the subsequent subsections. The following subsection presents a summary of the SCSO technique. The rationale for selecting Chaos-Enhanced Sand Cat Swarm Optimization (CHSCSO) in the context of text similarity optimization. Sand Cat Swarm Optimization (SCSO) is inspired by the adaptive and directional hunting behavior of sand cats, using an exploration and exploitation balance defined by a sensitivity range (R(t)) which shrinks over time to enable convergence. The SCSO position update is expressed as:
1
where controls the search range and gradually reduces, encouraging exploitation in later stages.Chaos-enhanced optimization replaces uniformly random numbers in the update rules with values derived from chaotic maps, introducing ergodicity and pseudo-randomness to the search. This ensures better diversity and prevents early stagnation:
2
where ensures chaotic behavior and replaces the random coefficient in position updates. The text similarity task, when fine-tuned with RoBERTa, inherently generates high-dimensional semantic embeddings. Optimizing thresholds or parameters (e.g., similarity margins, classification cut-offs) within such high-dimensional spaces often leads to local minima traps. Standard metaheuristics like Particle Swarm Optimization (PSO) and Genetic Algorithms (GA) may face premature convergence or inadequate exploration in such complex landscapes.Overview of SCSO technique
The four primary types of meta-heuristic techniques are swarm intelligence (SI), human behavior, physics-based, and evolutionary techniques [27]. SI techniques are currently attracting a lot of interest from researchers. The collective behavior of a decentralized or self-organizing system is another definition of the SI [28]. Many individuals with low intelligence engage with one another according to the basic principles that make up this technique. Several research efforts have been conducted in this field [28, 29, 30–31]. A novel SI-based meta-heuristic technique is the SCSO [26]. This technique can react fast and carry out exploration–exploitation operations appropriately because of the distinctive hearing and hunting skills of these desert cats. These cats have remarkable hearing abilities, and they hunt primarily at night. These cats satisfy at least 10% more food requirements than typical cats. These cats can also travel great distances without stopping, which raises the possibility of improved reactions in subsequent rounds. These unique characteristics allow these cats to track the position and movement of their prey more accurately. Sand cats (SC) go through two general phases while foraging, according to their behavioral traits. Finding prey is one of them; attacking the prey is the other. This makes it possible for the technique to be used effectively in both the exploration and exploitation phases. In this population-based technique, the search agent (sand cat) represents the unknown parameter in every challenge. Every search agent, or cat, is illustrated as a vector, and the size of the challenge is reflected in the vector's length. The fitness function of each challenge serves as the basis for evaluating how well the technique performed (Eq. 3).
3
The following equations (Eq. 4–Eq. 7) list the mathematical representations that work well in the exploration and exploitation stages of the SCSO. This guarantees that every SC arrives at the goal, which is the solution to the issue, by moving toward the prey.
4
5
6
is a parameter that, as iterations go on, drops linearly from 2 to 0. In both stages, two crucial coefficients are and . The technique's behavior is balanced between the two stages according to . is motivated by the sensitivity of SC's hearing. It turns out that both and are impacted by parameter . Inspired by ordinary SCSO [26], the given value of the constant S is 2. However, assigning various values is made feasible. Because of this flexibility, S can have a variety of integer values given to it depending on the requirements of various challenges. The maximum number of epochs is denoted by , while the current epoch is represented by . Equation 7 depicts a computational model of the SCSO technique in position modification, both during the exploration and exploitation stages:7
Where is the present location of every SC, is the best candidate location, is a randomized location, and is the location of the best SC globally. Using in a cosine helps the SC get closer to the prey. The SCs are instructed to attack when the criteria is satisfied; if not, the SCs are entrusted with locating a novel potential solution in the global region. The Roulette Wheel is used to choose this value. Algorithm 1 displays the SCSO techniques.
Integrating chaos theory with SCSO
Since actual engineering or design issues are challenging and complicated, meta-heuristic [23] techniques are often suggested to minimize implementation time and calculation costs. They can occasionally depart from the ideal solution due to issues like early convergence, poor search consistency, local optimum traps, inefficient searches, and inadequate population diversity [32].
The SCSO technique has some of these issues as well. We use a novel hybrid method that integrates chaotic maps with the characteristics of the recently released SCSO to get around these problems. These maps' dynamic nature can accelerate the search operation and provide benefits like fleeing the local region. Therefore, even if SCSO has a fair and balanced convergence rate, it might not always be effective in identifying the global optimum that influences the convergence rate. Furthermore, because of the random working mechanism, predators can be blind, which means they can only use the search space at a restricted rate. To address this shortcoming and boost efficiency, the technique known as CHSCSO incorporates the chaotic notion. Chaos is a random, deterministic-like process in a nonlinear, dynamic, non-converging, period-less, and finite environment. Additionally, it is affected by the starting values. The term “chaotic” refers to a complex system's highly unpredictable behavior. Chaotic systems can be viewed as the causes of randomness since, in mathematics, chaos characterizes the randomness of a basic deterministic dynamical process. In this sense, chaos is introduced into the SCSO using a variety of chaotic maps with distinct mathematical equations. The population is distributed uniformly due to the chaotic mapping. Using a function, chaos maps aim to correlate or match the behavior of chaos in the optimization process based on a parameter. Furthermore, these maps enable a more dynamic and global scanning of the search field, which benefits the SCSO by enabling dynamic behaviors. These maps demonstrate dynamic and complex behavior in nonlinear environments [33, 34].
The CHSCSO technique incorporates chaos using several well-known chaotic maps with distinct mathematical equations, which are mentioned in Table 2. 12 chaotic maps are used in the CHSCSO technique to adjust the regular SCSO technique's step size. It seeks to produce more stable and well-rounded solutions while raising the likelihood of population spread.
Table 2. Common chaotic maps utilized in the CHSCSO technique
No | Chaotic map | Name | Range |
|---|---|---|---|
1 | Quadratic | (0,1) | |
2 | Bernoulli | (0,1) | |
3 | Tent | (0,1) | |
4 | Logistic | (0,1) | |
5 | Gauss/Mouse | (0,1) | |
6 | Singer | (0,1) | |
7 | Sine | (0,1) | |
8 | Sinusoidal | (0,1) | |
9 | Piecewise | (0,1) | |
10 | Iterative | (-1,1) | |
11 | Circle | (0,1) | |
12 | Chebyshev | (-1,1) |
The rationale for selecting CHSCSO over other metaheuristics stems from its ability to improve convergence behavior and search diversity using chaotic maps. Unlike conventional techniques such as PSO or DE, which rely on fixed or linearly decaying parameters, CHSCSO dynamically perturbs the solution space through deterministic but non-repeating sequences. These chaotic maps guide agents toward global optima while reducing stagnation in local minima. This behavior is particularly suited for semantic textual similarity tasks, where the solution space is non-convex and high-dimensional, and gradient signals are noisy or sparse. By combining chaotic exploration with swarm-based exploitation, CHSCSO maintains balance across optimization phases. This is critical for fine-tuning representation-sensitive models like RoBERTa, where overfitting and semantic drift are prevalent challenges [9, 23].
Once the SCSO's core reproduction operation is complete, the newly developed location from the SCSO search processes, according to Table 2, is updated using the chaos map. For specific chaotic maps, the initial value might significantly impact the fluctuation behavior. The starting value for all maps in this integration was determined to be 0.7 [30, 35]. To improve the initial solution and raise the convergence precision in this situation, the chaotic map can help initialize the SC population. The CHSCSO uses a hybrid approach-based multi-strategy methodology. In this respect, the CHSCSO technique presents two approaches to location updating in stages. The conventional position updating method is one of them, while the chaotic model is the foundation for the other [30, 36]. According to Eq. 8, each has an equal likelihood of having a balanced weight.
8
where is a random value [0, 1], the exploration and exploitation stages can be enhanced by using Eq. 9. This equation is used to calculate the parameter rather than Eq. 4.However, is the most vital parameter for fulfilling an identical task and is crucial in demonstrating remarkably equitable and balanced behavior for both exploration and exploitation, it can experience sluggish convergence and unsuccessful exploitation, particularly in complicated and limited issues. Therefore, the parameters and are also impacted according to Eq. 10 and Eq. 11
9
where is a constant and.10
11
12
Utilizing a chaotic map, is computed. Equation 12 is used to calculate the CHSCSO general mathematical model. Thus, the proposed aims to utilize CHSCSO as an optimizer for RoBERTa rather than any technique to improve text similarity. Figure 2 presents the flowchart and algorithm 2 summarizes the CHSCSO technique, respectively.
[See PDF for image]
Fig. 2
Flowchart of CHSCSO technique
Bidirectional encoder representations from transformers (BERT)
In this subsection, the pre-trained BERT [1] is presented. The BERT setup is as follows: A combination of two sections (list of tokens), and , is passed into BERT. Typically, sections comprise many natural language sentences. With specific tokens separating them, the two sections are provided to BERT as one input chain: [CLS], , [SEP], , [EOS]. I and J are conditioned such that I + J < S, where S is a parameter that regulates the maximum chain length throughout training. A massive text without labels corpus is used for pre-training the BERT, and end-task data with labels is then used to fine-tune it. BERT utilizes the transformer design [37], which has M layers and is currently widely used. Self-attention heads and hidden dimension H are used in every block. Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) are utilized by BERT in its pre-training objectives. In the MML, the unique token [MASK] is substituted for a randomly chosen group of the input chain's tokens. A cross-entropy loss in guessing the masked tokens is the MLM goal.
BERT consistently chooses 15% of the input tokens for potential replacement. 80% of the selected tokens are swapped out for [MASK], 10% remain the same, and 10% are swapped out for a language selected at random. Although data is replicated in execution, the mask is not necessarily identical for each training phrase. In the beginning, random masking and replacement are done once at the start and saved for the duration of training. To determine whether two sections in the input text follow one another, NSP is a binary classification loss. Using consecutive sentences from the text corpus, positive examples are produced. Sections from various documents are paired to produce negative instances. There is an equal chance of sampling both negative and positive examples.
The goal of the NSP was to enhance effectiveness on downstream tasks that call for deductive thinking about the connection between phrase pairs, like Natural Language Inference [38].
Adam [39] optimizes BERT with the following parameters: β2 = 0.999, β1 = 0.9, L2 weight decay of 0.01, and ε = 1e-6. The learning rate decays linearly after warming up to a maximum value of over the first 10,000 attempts. BERT uses a GELU activation method [40] and develops with a dropout of 0.1 on all layers and attention weights. With mini batches of B = 256 chains of maximum length S = 512 tokens, the models are pretrained for T = one million updates.
RoBERTa: a robustly optimized BERT pre-training approach
Although language model pre-training has resulted in notable performance improvements, it can be difficult to carefully compare various strategies. As we can adjust, hyper-parameter selections have an important effect on the outcome, training is operationally costly and frequently involves private datasets of various sizes. The impact of numerous important hyper-parameters and training data size is meticulously measured in RoBERTa, which is the replication of BERT pre-training. As BERT was undertrained, it can operate on the same level as or better than any model released after it. The RoBERTa model performs at the cutting edge on SQuAD [41], RACE [42], and GLUE [2]. These findings call into question the origin of the latest published gains and emphasize the significance of previously disregarded design decisions. The basic features of RoBERTa are as follows: (1) learning the model with extended series; (2) eliminating the NSP target; (3) learning from larger batches of data; and (4) dynamically altering the masking pattern utilized on the training data. BERT uses token prediction and random masking.
A single static mask was produced by the original BERT implementation's single masking operation during data preparation. Training data was copied ten times so that each chain is masked in 10 distinct ways during the 40 training epochs, preventing the use of the same mask for every training in every epoch. As a result, during training, the identical mask was observed 4 times for each training sequence. RoBERTa utilizes dynamic masking [43], where the masking pattern is created every time the sequence is fed to the model. When pre-training for additional steps or with bigger datasets, this becomes essential. A larger byte-level Byte-Pair Encoding (BPE) [44], dynamic masking, larger mini-batches [45], and FULL-SENTENCES without NSP loss are utilized RoBERTa. (BPE) combines word-level and character-level encoding to handle the extensive vocabulary found in natural language corpora. BPE uses sub-word units—which are obtained through the statistical examination of the training corpus—instead of whole words. Typically, BPE vocabulary sizes fall between 10,000 and 100,000 sub-word units. FULL-SENTENCES: Every input is made up of consecutive full phrases that are collected from one or more documents, with a maximum total size of 512 tokens. Document boundaries could be crossed by inputs. We add an additional delimiter token between documents and start sampling phrases from the subsequent document when we finish one.
Theoretical motivation and comparative analysis of CHSCSO
CHSCSO is selected due to its integration of chaotic maps into the Sand Cat Swarm Optimization framework, which offers key advantages over other swarm and chaotic optimization variants such as PSO, WOA, DE, and FPA. Unlike PSO and DE, which rely on fixed or linearly decreasing parameters, CHSCSO uses deterministic chaotic sequences to dynamically perturb the search space. This prevents premature convergence and enhances the exploration of the global optimum [48, 49]. The hybrid strategy in CHSCSO combines swarm-based exploitation with chaos-driven exploration. This balance is critical in high-dimensional, non-convex problems like fine-tuning transformer embeddings, where traditional algorithms often stagnate in local optima [50]. The chaos-enhanced search process enables CHSCSO to maintain population diversity, addressing common metaheuristic drawbacks such as early convergence and low consistency [50, 51].
Result and experimental
This section presents the results and experimental analysis of the proposed text similarity model, which combines CHSCSO's strengths with RoBERTa, a transformer-based language model. CHSCSO, a novel metaheuristic algorithm inspired by sand cat hunting, optimizes feature extraction and representation to improve RoBERTa’s semantic similarity detection. In experiments using benchmark datasets, the proposed hybrid model is evaluated for accuracy, resilience, and computational efficiency. Chaotic optimization can be used to fine-tune and refine transformer-based models (RoBERTa) for text similarity tasks, revealing the synergy between bio-inspired algorithms and deep learning architecture.
Datasets
Semantic textual similarity dataset
STS dataset is used to quantify semantic similarity between paragraphs [46]. This task has many uses. The dataset consists of 4022 rows of text data, with a 1 being highly similar and 0 being highly dissimilar. The dataset used for textual similarity evaluation is described in Table 3, outlining its format, labels, and evaluation metrics.
Table 3. STS Dataset Description
Aspect | Details |
|---|---|
Context | Semantic Textual Similarity (STS) is a foundational problem in natural language understanding |
Problem statement | Given two paragraphs, quantify the degree of similarity between them based on semantic meaning |
Objective | Predict a similarity score between 0 and 1, where: |
- 1: Highly similar | |
- 0: Highly dissimilar | |
Dataset | - Columns: text1, text2 (both contain text data) |
- Rows: 4022 | |
- Each row contains a pair of paragraphs | |
- Labels: Similarity score between 0 and 1 (not provided in the dataset, needs to be predicted) | |
Output | - A similar score between 0 and 1 for each pair of paragraphs |
Textual entailment dataset
This dataset contains natural language understanding textual entailment data [47]. It has three files: validation.csv, train.csv, and test.csv, each comprising model training and evaluation columns. The dataset has text1, text2, label, and label_text columns. The validation.csv, train.csv, and test.csv files are designed for textual entailment tasks. This data helps academics build advanced natural language relationship models.
Table 4 presents an overview of the textual entailment dataset, outlining the structure and classification labels utilized in our research.
Table 4. The Textual Entailment Dataset Description
Column name | Description |
|---|---|
text1 | First text in pair to be assessed for textual entailment |
text2 | Second text in pair to be matched to text1 to determine logical relationship |
label | A categorical field that shows text1 and text2's meaning or logical inference relationship |
label_text | The label's human-readable text helps explain the real-world consequences of text1 and text2 |
Data preprocessing
The preprocessing pipeline for text similarity evaluation comprises several essential processes to guarantee that the input data is clean, consistent, and appropriate for model training. Initially, tokenization is executed with RoBERTa’s tokenizer, which transforms raw text into sub-word tokens that align with the model architecture. Upon receiving an input string , the tokenizer produces a sequence of tokens , with n representing the total number of tokens. This procedure is crucial for deconstructing text into manageable units that the model can analyze. The text is thereafter subjected to lowercasing, punctuation removal, and stopword filtration. Lowercasing guarantees consistency by transforming all characters to lowercase, whereas punctuation removal eliminates non-alphanumeric characters that can lack semantic significance. Stopword filtering eliminates common words that generally lack significant meaning. After cleaning, the token sequences are either padded or shortened to a standardized length of 128 tokens to maintain consistent input dimensions for the model. Padding adds unique tokens to shorter sequences, while truncation removes excess tokens from longer sequences. This step ensures that all input sequences have the same length, which is crucial for batch processing. Table 5 illustrates the stopwords and punctuation removed, and their tokenized sequences, which represent numerical encodings of the cleaned text for further natural language processing tasks. The token length distribution for Text1 and Text2 is shown in Fig. 3, illustrating the variation in input sequence lengths across the dataset.
Table 5. Comparison of Preprocessed Text Data, Cleaned Text, and Tokenized Representations
text1 | text1_cleaned | text1_tokens | text2 | text2_cleaned | text2_tokens |
|---|---|---|---|---|---|
Savvy searchers fail to spot ads. Internet sea… | savvy searchers fail spot ads internet search… | [0, 27,816, 11,454, 38,850, 7873, 5998, 1514, 581…] | Newcastle 2–1 Bolton: Kieron Dyer clinched the… | newcastle 21 bolton kieron dyer clinched win e… | [0, 4651, 24,773, 733, 24,724, 1054, 449, 906, 2…] |
Newcastle 2–1 Bolton: Kieron Dyer clinched the… | newcastle 21 bolton kieron dyer clinched win e… | [0, 4651, 24,773, 733, 24,724, 1054, 449, 906, 2…] | I enjoy working with text data and NLP | Enjoy working text data nlp | [0, 225, 20,768, 447, 2788, 414, 295, 39,031, 2,…] |
I love natural language processing! | love natural language processing | [0, 17,693, 1632, 2777, 5774, 2, 1, 1, 1, 1, 1,…] | Measuring text similarity is crucial for many … | measuring text similarity crucial many applica… | [0, 1794, 40,786, 2788, 37,015, 4096, 171, 2975,…] |
Text similarity is an important task in NLP | text similarity important task nlp | [0, 29,015, 37,015, 505, 3685, 295, 39,031, 2, 1,…] | Savvy searchers fail to spot ads. Internet sea… | savvy searchers fail spot ads internet search… | [0, 27,816, 11,454, 38,850, 7873, 5998, 1514, 581…] |
[See PDF for image]
Fig. 3
The token length distribution for Text1 and Text2
To preprocess text for similarity assessment, multiple transformations are applied. Table 5 compares the raw text, cleaned text, and tokenized representations. This table displays pairs of original text samples (text1 and text2).
Model architecture
The model architecture for the Intelligent Text Similarity Assessment System is designed to leverage the power of RoBERTa for text embedding, combined with a custom CHSCSO technique to optimize the embedding for similarity tasks. The architecture, layers, optimizers, and hyperparameters used in the system. The proposed model architecture can be presented in detail in Fig. 3 and in the following subsections.
RobertaTokenizer
The RobertaTokenizer is a pre-processing tool used to convert input text into a format compatible with the RoBERTa model. It breaks the input text into sub-word tokens that can be mapped to the model's vocabulary. This tokenization process involves splitting words into smaller units (sub-words) based on the model's pre-trained vocabulary, which helps in handling out-of-vocabulary words and reducing the complexity of the data input.
Text embedding with RoBERTa
The model is based on the RoBERTa architecture, which is utilized to produce contextualized embeddings for pairs of input text. RoBERTa is a transformer-based model that effectively captures semantic relationships in a text. Table 6 provides a summary of the tokenized inputs, hidden states, and mean embedding derived from RoBERTa.
Table 6. Details of Tokenized Input, Hidden States, and Mean Embedding
Key | Value |
|---|---|
Input_ids | tensor([[ 0, 4651, 24,773, 132, 12, 134, 24,724, 1054, 449, 906, 261, 385, 6426, 13,263, 184, 5, 1924, 7, 253, 24,724, 1054, 579, 158, 12, 2670, 9797, 422, 4, 1437, 2084, 242, 7323, 6426, 342, 92, 24,773, 789, 77, 37, 9789, 1149, 2457, 512, 338, 15, 5, 235, 22,480, 1437, 172, 12,631, 196, 88, 5, 443, 7, 476, 184, 10, 12,734, 31, 5, 41,474, 2116, 4, 1437, 2]]) |
Attention_mask | tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]) |
Hidden states | torch.Size([1, 67, 768]) |
Mean embedding | torch.Size([1, 768]) |
RobertaModel
The RobertaModel contains numerous layers of bidirectional transformers that record contextual relationships between words in the input text. RoBERTa produces a hidden state vector for each token in the input sequence, encapsulating the contextualized representation of the token. The hidden states are represented as , where signifies the length of the input sequence. These vectors represent the contextualized meaning of each token, considering the adjacent tokens inside the sentence.
The embedding extraction method consolidates these hidden states to derive a significant representation of the entire input text. One common method is mean pooling, where the hidden states of all tokens in the sequence are averaged to obtain a single vector representing the entire input. Mathematically, this can be expressed as follows:
13
where is the final embedding vector, is the hidden state of the token in the sequence, and is the number of tokens in the input sequence. This results in a fixed-size vector, typically of size , where is the dimensionality of the hidden states. The output embedding, , is a vector that captures the semantic information of the input text, which can be used for downstream tasks such as text classification [52, 53–54], question answering, or any other natural language processing application. Figure 4a shows the Heatmap representing token embeddings (hidden states) across 768 hidden dimensions for various tokens. The color scale indicates embedding values, ranging from − 10 to 10, with brighter areas highlighting higher values and Line plot of the mean embedding values averaged across all tokens for each hidden dimension, illustrating the distribution and variance of embedding magnitudes across dimensions. Figure 4 visualizes the embedding generated by RoBERTa, displaying (a) a heatmap of token embeddings and (b) the distribution of mean embedding values across hidden dimensions.[See PDF for image]
Fig. 4
a Heatmap of token embedding. b Mean embedding values across hidden dimensions
The hyperparameters for RoBERTa and the CHSCSO technique are summarized in Table 7.
Table 7. Hyperparameters for RoBERTa and CHSCSO
Component | Hyperparameter | Value |
|---|---|---|
RoBERTa | Maximum sequence length | 128 |
Hidden size | 768 | |
Number of layers | 12 | |
Learning rate (for fine-tuning) | 2e-5 | |
CHSCSO | Number of agents | 10 |
Number of iterations | 50 | |
Embedding dimension | 768 (same as RoBERTa) | |
Similarity Threshold | Threshold for binary classification | 0.5 |
This paper presents a model that combines RoBERTa with CHSCSO to improve the efficacy of text similarity-based summarization tasks. The RoBERTa model, pre-trained with a hidden size of 768 and consisting of 12 transformer layers, is employed to produce contextual embeddings for input texts. The model handles input sequences with a maximum length of 128 tokens, thereby maintaining essential contextual information. A learning rate of is utilized for fine-tuning to enhance performance while ensuring stability.
The CHSCSO technique is utilized to enhance the similarity threshold for summary extraction. The optimization procedure includes 10 agents navigating the search space across 50 iterations to find the optimal threshold for differentiating salient sentences. The embedding dimension of 768, corresponding to RoBERTa's hidden size, guarantees uniform representation in similarity calculations. A similarity threshold of 0.5 is set to execute binary classification, differentiating statements that significantly contribute to the summary from those that do not. This integrated method utilizes RoBERTa's robust language representations and CHSCSO's effective optimization skills to attain enhanced summarization accuracy. The experimental procedure commences with the initialization of a population of candidate solutions, each representing a configuration of weights for assessing text similarity.
Chaotic maps are utilized to improve the diversity of the original population and prevent early convergence. Chaotic maps generate unpredictability while maintaining deterministic characteristics, facilitating a more thorough investigation of the solution space during the early stages. The efficacy of each candidate solution is evaluated utilizing a fitness function based on cosine similarity. Specifically, RoBERTa embeddings are produced for the input text pairings, and the cosine similarity between these embeddings functions as a metric to evaluate the efficacy of each weight set in measuring text similarity. Better cosine similarity scores reflect better alignment with the target similarity measurement. The Chaotic CHSCSO optimization method balances exploration and exploitation phases. To avoid local optima and ensure unpredictability, the technique searches the solution space globally using chaotic maps during exploration. The exploitation phase refines interesting solutions using adaptive parameters and local search to refine candidate weights. To maximize the fitness function, CHSCSO iteratively updates candidate solutions over multiple generations. The technique uses chaotic dynamics and adaptive algorithms to alter weights in each iteration, improving similarity measurement accuracy. The RoBERTa embedding uses the CHSCSO-identified best weights once optimization converges. Optimized weights alter embedding dimensions to calculate the text similarity score more precisely. This weighted similarity measurement reflects the nuanced relationships captured by RoBERTa while enhancing performance through optimization. Table 8 provides the architecture of the RoBERTa model, listing its key layers and their functionalities.
Table 8. Architecture of the RoBERTa Model with its Layers and Components
Layer | Sub-layer | Details |
|---|---|---|
RobertaModel | Embeddings | Embedding layers that transform input tokens into dense vectors |
- Word_embeddings | Embedding (50,265, 768, padding_idx = 1) | |
- Position_embeddings | Embedding (514, 768, padding_idx = 1) | |
- Token_type_embeddings | Embedding (1, 768) | |
- LayerNorm | LayerNorm((768,), eps = 1e-05, elementwise_affine = True) | |
- Dropout | Dropout (p = 0.1, inplace = False) | |
RobertaEncoder | Layer | It contains multiple layers (12 in this case). Each layer has attention, intermediate, and output parts |
- RobertaLayer (0–11) | Repeated 12 times, each with attention, intermediate, and output components | |
RobertaLayer | Attention | Self-attention mechanism |
- RobertaSdpaSelfAttention | Includes query, key, value linear layers with dropout | |
- Query | Linear (in_features = 768, out_features = 768, bias = True) | |
- Key | Linear (in_features = 768, out_features = 768, bias = True) | |
- Value | Linear (in_features = 768, out_features = 768, bias = True) | |
- Dropout | Dropout (p = 0.1, inplace = False) | |
Output | Dense layer followed by LayerNorm and dropout | |
- RobertaSelfOutput | Consists of dense layer, layer normalization, and dropout | |
- Dense | Linear (in_features = 768, out_features = 768, bias = True) | |
- LayerNorm | LayerNorm((768,), eps = 1e-05, elementwise_affine = True) | |
- Dropout | Dropout (p = 0.1, inplace = False) | |
Intermediate | Dense | Linear (in_features = 768, out_features = 3072, bias = True) |
Intermediate_act_fn | GELUActivation() | |
Output | Dense | Linear (in_features = 3072, out_features = 768, bias = True) |
LayerNorm | LayerNorm((768,), eps = 1e-05, elementwise_affine = True) | |
Dropout | Dropout (p = 0.1, inplace = False) | |
RobertaPooler | Dense | Linear (in_features = 768, out_features = 768, bias = True) |
Activation | Tanh () |
The RoBERTaModel consists of embedding layers that transform input tokens into dense vector representations suitable for downstream processing. Figure 5a illustrates the distribution of cosine similarity scores and binary labels (0/1), indicating how well similarity scores align with the labeled classification. Figure 5b presented with the bar chart shows the distribution of labels (0, 1, and 2) in the training dataset, indicating a balanced representation across classes, demonstrating an equal number of samples for each class, and ensuring balanced learning during model training. Figure 5c displays a heatmap that visualizes similarity scores between text pairs. Figure 5d the plot displays similar values across different text pairs.
[See PDF for image]
Fig. 5
Similarity Score Distributions (a) The scatter plot visualizes the relationship between cosine similarity scores and binary labels (0/1). b Label Distribution of Training Data. The bar chart shows the distribution of labels (0, 1, and 2) in the training dataset, indicating a balanced representation across classes. c Heatmap of Similarity Scores. d Scatter Plot of Similarity Scores
Figure 6 illustrates the clustering of similarity scores near the upper bound (e.g., RoBERTa: 0.996–0.998) demonstrating the model’s high sensitivity to subtle semantic variations, such as those found in paraphrased or synonymously reworded text pairs. This narrow distribution indicates RoBERTa’s robust capability to consistently identify semantic equivalence, even in the presence of minor syntactic alterations (e.g., synonym substitution or changes in word order). In contrast, BERT exhibits a broader and lower-centered distribution (approximately around 0.7), suggesting greater variability in its semantic interpretation. Such dispersion may result in less reliable performance in tasks requiring fine-grained semantic matching, such as paraphrase detection or synonym recognition.
[See PDF for image]
Fig. 6
Model Performance Comparison. a Distribution of Text Similarity Scores (RoBERTa). b Distribution of Text Similarity Scores (Siamese Network. c Distribution of Text Similarity Scores (DistilBERT). (d) Distribution of Text Similarity Scores (BERT)
Moreover, the observed distribution patterns offer valuable insights into model suitability for downstream applications:
RoBERTa is well-suited for high-precision semantic tasks, including plagiarism detection, duplicate question identification, and paraphrase mining.
DistilBERT and Siamese Networks present a favorable trade-off between computational efficiency and semantic resolution, making them appropriate for tasks such as semantic search and information retrieval.
BERT, due to its broader variance, may be more appropriate for general-purpose sentence embedding applications where exact semantic alignment is less critical.
Figure 7 presents the t-SNE projection of text embeddings, showing clusters of similar texts based on cosine similarity scores, with red indicating high similarity and blue indicating lower similarity. A comparative analysis of different models in text similarity tasks is shown in Table 9, which reports similarity scores and inference times for each method.
[See PDF for image]
Fig. 7
t-SNE for Text Embeddings
Table 9. Performance Comparison of Various Models in Text Similarity Tasks
Model | Unique_ID | Similarity | Inference time (s) |
|---|---|---|---|
TF-IDF + cosine similarity | 0 | 0.108828 | 6.007375 |
1 | 0.155510 | 6.007375 | |
2 | 0.155159 | 6.007375 | |
3 | 0.048681 | 6.007375 | |
4 | 0.253110 | 6.007375 | |
BERT fine-tuning | 0 | 0.044040 | 21.577113 |
1 | 0.275315 | 21.577113 | |
2 | 0.406762 | 21.577113 | |
3 | 0.268611 | 21.577113 | |
4 | 0.102680 | 21.577113 | |
Siamese networks | 0 | 0.461061 | 0.002993 |
1 | 0.675856 | 0.002993 | |
2 | 0.101843 | 0.002993 | |
3 | 0.705613 | 0.002993 | |
4 | 0.259722 | 0.002993 | |
Triplet loss | 0 | 0.220165 | 0.000999 |
1 | 0.280407 | 0.000999 | |
2 | 0.657491 | 0.000999 | |
3 | 0.349002 | 0.000999 | |
4 | 0.287209 | 0.000999 | |
GNNs | 0 | 0.725826 | 0.000995 |
1 | 0.622362 | 0.000995 | |
2 | 0.992966 | 0.000995 | |
3 | 0.151450 | 0.000995 | |
4 | 0.547073 | 0.000995 | |
Contrastive learning | 0 | 0.591419 | 0.000996 |
1 | 0.565422 | 0.000996 | |
2 | 0.186834 | 0.000996 | |
3 | 0.843988 | 0.000996 | |
4 | 0.681043 | 0.000996 | |
Supervised contrastive learning | 0 | 0.503250 | 0.000997 |
1 | 0.971830 | 0.000997 | |
2 | 0.861270 | 0.000997 | |
3 | 0.714328 | 0.000997 | |
4 | 0.857253 | 0.000997 | |
Cross-encoder models | 0 | 0.342477 | 0.001008 |
1 | 0.864705 | 0.001008 | |
2 | 0.939827 | 0.001008 | |
3 | 0.576355 | 0.001008 | |
4 | 0.372767 | 0.001008 | |
Multi-class classification | 0 | 0.285653 | 0.001000 |
1 | 0.094703 | 0.001000 | |
2 | 0.075697 | 0.001000 | |
3 | 0.522158 | 0.001000 | |
4 | 0.338432 | 0.001000 | |
Our proposed model | 0 | 0.995612 | 22.314125 |
1 | 0.996871 | 22.314125 | |
2 | 0.998123 | 22.314125 | |
3 | 0.997456 | 22.314125 | |
4 | 0.996732 | 22.314125 |
We implemented a dynamic threshold optimization framework using a chaotic hybrid stochastic strategy optimizer (CHSCSO) to empirically derive the optimal cosine similarity threshold for our dataset. The optimization process converged consistently around 0.5, thus providing empirical support for the threshold initially used. The framework is generalizable and can be extended for cross-validation or dataset-specific tuning. Thus, while the threshold of 0.5 may appear heuristic at first glance, it is in fact data-driven and reproducible using the outlined optimization procedure as shown in Fig. 8.
[See PDF for image]
Fig. 8
The similarity distribution, threshold sensitivity, and sample similarity matrices to justify the empirical selection
Figure 9 Comparative analysis of semantic similarity methods across performance (MAE, MSE), correlation with ground truth (Pearson, Spearman), statistical significance, and computation time in Table 10, where proposed work demonstrates superior accuracy but at high computational cost. TF-IDF presents the most favorable accuracy-time tradeoff. All models differ significantly (p < 1e-100), as visualized in the significance plot. Models like Siamese, Triplet, and GNN, despite fast execution, yield low reliability. Although models like Triplet and CrossEnc yield higher similarity scores, they are statistically worse than BERT according to t-tests (p < 0.001). TF-IDF shows the fastest inference among traditional methods but significantly underperforms. BERT serves as the statistical baseline with the longest inference time. Other contrastive and neural approaches outperform BERT in similarity score but not in statistical significance as illustrated in Table 11. Figure 10 presents a comparative SHAP (SHapley Additive exPlanations) analysis of two feature representations used in the proposed work model, demonstrating that semantically meaningful words contribute significantly to model predictions, thereby enhancing interpretability. Conversely, common stopwords have a negligible impact. This reinforces the model's reliance on domain-specific vocabulary
[See PDF for image]
Fig. 9
Comparative analysis of semantic similarity methods across performance (MAE, MSE), correlation with ground truth (Pearson, Spearman), statistical significance, and computation time
Table 10. a comprehensive comparison of various similarity models across multiple performance and interpretability dimensions
Method | MAE ↓ | Time (s) ↓ | Tradeoff score (Normalized) | Significance vs BERT |
|---|---|---|---|---|
BERT | 0.000 | 18.13 | 1.000 | — (baseline) |
TF-IDF | 0.103 | 4.98 | 0.542 | p = 1.67e-130 (↑) |
Siamese | 0.381 | 0.000 | 0.991 | p ≈ 0 (↓) |
Triplet | 0.376 | 0.000 | 0.976 | p ≈ 0 (↓) |
MultiClass | 0.376 | 0.000 | 0.977 | p ≈ 0 (↓) |
Contrastive | 0.377 | 0.000 | 0.980 | p ≈ 0 (↓) |
SupCon | 0.377 | 0.000 | 0.981 | p ≈ 0 (↓) |
GNN | 0.375 | 0.000 | 0.973 | p ≈ 0 (↓) |
CrossEnc | 0.385 | 0.000 | 1.000 | p ≈ 0 (↓) |
Method | Mean | Std Dev | Time (s) | MSE | MAE | Pearson ρ | Spearman ρ |
|---|---|---|---|---|---|---|---|
BERT | 0.169 | 0.140 | 18.23 | 0.000 | 0.000 | 1.00 | 1.00 |
TF-IDF | 0.121 | 0.058 | 5.06 | 0.017 | 0.103 | 0.496 | 0.372 |
CrossEnc | 0.496 | 0.289 | 0.00 | 0.211 | 0.377 | − 0.022 | − 0.016 |
Siamese | 0.497 | 0.290 | 0.00 | 0.211 | 0.378 | − 0.007 | − 0.005 |
Triplet | 0.505 | 0.288 | 0.00 | 0.213 | 0.380 | 0.016 | 0.015 |
SupCon | 0.507 | 0.289 | 0.00 | 0.215 | 0.381 | 0.026 | 0.023 |
MultiClass | 0.503 | 0.290 | 0.00 | 0.216 | 0.381 | − 0.016 | − 0.015 |
Contrastive | 0.504 | 0.289 | 0.00 | 0.216 | 0.382 | − 0.008 | − 0.014 |
GNN | 0.504 | 0.289 | 0.00 | 0.215 | 0.382 | − 0.004 | − 0.003 |
Table 11. Comparison of text similarity methods based on Mean Similarity Score with standard deviation and inference time
Method | Mean | Std Dev | Inference Time (s) | t-statistic | p-value | Interpretation |
|---|---|---|---|---|---|---|
TF-IDF | 0.1211 | 0.0580 | 4.92 | 20.31 | 1.89e-88 | Significantly worse than BERT (p < 0.001) |
BERT | 0.1695 | 0.1395 | 19.04 | – | – | Baseline for comparison |
SupCon | 0.4955 | 0.2893 | 0.00 | − 64.38 | 0.00 | Statistically worse than BERT (p < 0.001), despite higher mean similarity |
GNN | 0.4959 | 0.2855 | 0.00 | − 65.16 | 0.00 | Statistically worse than BERT (p < 0.001) |
MultiClass | 0.4963 | 0.2899 | 0.00 | − 64.42 | 0.00 | Statistically worse than BERT (p < 0.001) |
Siamese | 0.4973 | 0.2876 | 0.00 | − 65.05 | 0.00 | Statistically worse than BERT (p < 0.001) |
Contrastive | 0.4992 | 0.2924 | 0.00 | − 64.54 | 0.00 | Statistically worse than BERT (p < 0.001) |
CrossEnc | 0.5005 | 0.2892 | 0.00 | − 65.38 | 0.00 | Statistically worse than BERT (p < 0.001) |
Triplet | 0.5035 | 0.2866 | 0.00 | − 66.44 | 0.00 | Statistically worse than BERT (p < 0.001) |
[See PDF for image]
Fig. 10
SHAP (SHapley Additive exPlanations) value analysis of feature contributions to model output
Table 12 highlights the efficiency of CHSCSO, making it the best option for real-time text similarity applications. Figures 11 and 12 compares the similarity scores of various models, showing that Supervised Contrastive Learning and Cross Encoder Models achieve the highest similarity scores, while TF-IDF + Cosine Similarity performs the lowest. Figure 13 compares similarity scores generated by different models across text pairs, highlighting variations in prediction consistency, where deep learning-based models demonstrate more stable performance compared to traditional approaches with effectively showcases the score clustering and variance through the height of bars and error lines, it does not contextualize how these scores translate into real-world semantic interpretation tasks such as:
Synonym recognition: e.g., detecting that “physician” and “doctor” refer to the same concept.
Paraphrase detection: e.g., identifying that “She enjoys painting in her free time” and “In her spare time, she likes to paint” are semantically equivalent.
Table 12. The comparison of the similarity results and inference times across different methods
ID | TF-IDF similarity | TF-IDF time (s) | RoBERTa similarity | RoBERTa time (s) | CHSCSO similarity | CHSCSO time (s) |
|---|---|---|---|---|---|---|
0 | 0.1088 | 5.603 | 0.9955 | 1937.284 | 0.9072 | 0.002 |
1 | 0.1555 | 5.603 | 0.9970 | 1937.284 | 0.1245 | 0.002 |
2 | 0.1552 | 5.603 | 0.9970 | 1937.284 | 0.9548 | 0.002 |
3 | 0.0487 | 5.603 | 0.9979 | 1937.284 | 0.0457 | 0.002 |
4 | 0.2531 | 5.603 | 0.9975 | 1937.284 | 0.5888 | 0.002 |
[See PDF for image]
Fig. 11
presents a visual comparison of similarity score distributions across three models: a random baseline, TF-IDF, and BERT, reflecting its lexical and sparse representation limitations, while BERT demonstrates a broader and more diverse distribution of similarity scores, showcasing its capacity to capture nuanced semantic relationships between text pairs
[See PDF for image]
Fig. 12
The random baseline's smooth, bell-shaped distribution, while the bottom plot contrasts this with the performance of TF-IDF (sharp peak at low scores) and BERT (broader distribution)
[See PDF for image]
Fig. 13
Similarity score comparison across methods
The similarity score distribution of a baseline, where scores are uniformly distributed across the [0,1] range, is compared to the density curves of similarity scores for three methods: TF-IDF, BERT, and Baseline. TF-IDF shows a sharp peak around a lower similarity score, indicating sparse semantic matching. BERT, on the other hand, produces a wider distribution with higher density in the mid-range, reflecting its deeper contextual understanding. Inference times highlight BERT’s higher computational cost (18.42 s) compared to TF-IDF (15.17 s) and the negligible cost of the random baseline (0.00 s) that illustrated on Fig. 12. The experimental evaluation included cross-domain testing across three distinct domains. We conducted comparative experiments using three different methods such as TF-IDF + Cosine Similarity, BERT Fine-Tuning, and Random Similarity on each domain. The dataset sizes used for training and testing are illustrated in Table 13. These results demonstrate that our model retains high performance across multiple domains, highlighting its strong generalization capability. Notably, we observed that the performance measured by similarity scores and evaluation metrics such as F1 and precision remains consistently high in unseen domains, thus countering the concern of overfitting or domain-specific optimization. Furthermore, we adopted a Dynamic Threshold Optimization strategy based on chaotic heuristic stochastic search (CHSSO) to ensure that the 0.5 cosine similarity threshold used for classification was not arbitrarily selected but derived through optimization. This process improved domain adaptability by tuning thresholds in a data-driven manner.
Table 13. cross-domain testing across three distinct domains
Method | Domain | Train size | Test size |
|---|---|---|---|
TF-IDF + Cosine Similarity | Legal | 2723 | 1300 |
TF-IDF + Cosine Similarity | News | 2628 | 1395 |
TF-IDF + Cosine Similarity | Medical | 2695 | 1328 |
BERT Fine-Tuning | Legal | 2723 | 1300 |
BERT Fine-Tuning | News | 2628 | 1395 |
BERT Fine-Tuning | Medical | 2695 | 1328 |
Random Similarity | Legal | 2723 | 1300 |
Random Similarity | News | 2628 | 1395 |
Random Similarity | Medical | 2695 | 1328 |
Discussion
We present a comparative analysis of various models based on their similarity scores and inference times. The models evaluated include TF-IDF with Cosine Similarity, BERT Fine-Tuning, Siamese Networks, Triplet Loss, Graph Neural Networks (GNNs), Contrastive Learning, Supervised Contrastive Learning, Cross-Encoder Models, and Multi-Class Classification. TF-IDF + Cosine Similarity demonstrated modest similarity scores, ranging from 0.048681 to 0.253110, with a consistent inference time of 6.007 s across all instances. This high inference reflects the computational overhead of traditional vector space models compared to neural-based methods. BERT Fine-Tuning yielded higher variability in similarity scores, peaking at 0.406762 while dropping to as low as 0.044040. However, its inference time of 21.577 s per instance highlights the computational intensity of fine-tuning large transformer models, making it less suitable for real-time applications despite its potential for capturing complex semantics. In contrast, Siamese Networks achieved more consistent and higher similarity scores, ranging from 0.101843 to 0.705613, with an exceptionally low inference time of 0.002993 s. This illustrates the model's efficiency and effectiveness in similarity measurement tasks. Triplet Loss models showed a moderate performance with similarity scores between 0.220165 and 0.657491, and an inference time of 0.000999 s, indicating faster processing while maintaining competitive accuracy. Graph Neural Networks (GNNs) outperformed most models in terms of similarity, with scores reaching up to 0.992966 and a minimal inference time of 0.000995 s. This indicates the higher capacity of GNNs to capture relational structures in data while preserving computational effectiveness. Contrastive Learning models yielded consistent outcomes, with similarity scores between 0.186834 and 0.843988 and an inference time of 0.000996 s, demonstrating their efficacy in discerning significant variations in data representations. Supervised Contrastive Learning enhanced conventional contrastive techniques, with similarity scores ranging from 0.503250 to 0.971830, with an inference duration of 0.000997 s. The supervised element clearly improved the model's discriminative capability.
Conclusion
NLP applications like plagiarism detection, information retrieval, and recommendation algorithms depend on accurate text similarity evaluation. The study introduces an Intelligent Text Similarity Assessment model that uses RoBERTa’s contextual embeddings and CHSCSO to overcome overfitting and local optima stagnation. CHSCSO increases model generalization, semantic robustness, and exploration–exploitation trade-off by optimizing parameters under chaotic perturbations. The model outperforms RoBERTa fine-tuning and other baseline models in accuracy, stability, and convergence, according to extensive benchmark dataset studies. Notably, the optimized model achieves an inference time of 0.002 s, making it highly suitable for real-time applications like live plagiarism detection and dynamic information retrieval. These findings highlight the potential of combining deep learning with swarm intelligence for more efficient, adaptive, and scalable text similarity assessments. For future work, we can explore integrating RoBERTa with alternative meta-heuristic optimization techniques to enhance text similarity performance. These techniques may offer improved convergence speed and better hyper-parameter tuning. Additionally, we can investigate the effectiveness of other pre-trained models.
Acknowledgements
This work was supported by the Science, Technology & Innovation Funding Authority (STDF) in collaboration with the Egyptian Knowledge Bank (EKB).
Author contributions
E.H: Conceptualization, methodology, formal analysis, software implementation, experimentation, data curation, and manuscript writing. A. S: Validation, supervision, review, and editing of the manuscript, as well as providing critical insights into the research methodology and experimental design. M. E: project administration, and final manuscript revision. All authors have read and approved the final version of the manuscript.
Funding
Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB).
Data availability
No datasets were generated or analysed during the current study.
Declarations
Ethics approval and consent to participate
Not applicable. This article does not contain any studies with human participants or animals performed by any of the authors.
Competing interests
The authors declare no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Xu, K; Song, Y; Ma, J. Identifying protected health information by transformers-based deep learning approach in Chinese medical text. Health Informatics J; 2025; 31,
2. Wang, A et al. SuperGLUE: a multi-task benchmark and analysis platform for natural language understanding. Adv Neural Inf Process Syst; 2019; 32, pp. 3261-3275.
3. Peng, Q; Luo, X; Yuan, Y; Gu, F; Shen, H; Huang, Z. A text classification method combining in-domain pre-training and prompt learning for the steel e-commerce industry. Int J Web Inf Syst; 2025; 21,
4. Wu H et al. Adversarial self-attention for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence. 2023.
5. Deshpande A et al. C-STS: Conditional semantic textual similarity. arXiv preprint. 2023. arXiv:2305.15093, 2023.
6. Talaat, AS. Sentiment analysis classification system using hybrid BERT models. J Big Data; 2023; 10,
7. Li X, Li J. Angle-optimized text embeddings. arXiv preprint. 2023. arXiv:2309.12871
8. Zhou, Y et al. A short-text similarity model combining semantic and syntactic information. Electronics; 2023; 12,
9. Kiani, F et al. Chaotic sand cat swarm optimization. Mathematics; 2023; 11,
10. Ponwitayarat W et al. Space decomposition for sentence embedding. arXiv preprint. 2024. arXiv:2406.03125, 2024.
11. Kachwala Z et al. REMATCH: Robust and Efficient Matching of Local Knowledge Graphs to Improve Structural and Semantic Similarity. arXiv preprint. 2024. arXiv:2404.02126
12. Shu Y, Lampos V, Unsupervised hard Negative Augmentation for contrastive learning. arXiv preprint. 2024. arXiv:2401.02594.
13. Morazzoni I, Scotti V, Tedesco R. DEF2VEC: Extensible Word Embeddings from Dictionary Definitions. In Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023). Association for Computational Linguistics. 2023.
14. Wang, Y et al. Collective human opinions in semantic textual similarity. Trans Assoc Comput Linguistics; 2023; 11, pp. 997-1013.
15. Al Sulaiman, M et al. Semantic textual similarity for modern standard and dialectal Arabic using transfer learning. PLoS ONE; 2022; 17,
16. Li X, Li J. AoE: Angle-optimized embeddings for semantic textual similarity. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.
17. Chowdhury, S; Soni, B. R-VQA: a robust visual question answering model. Knowl-Based Syst; 2025; 30,
18. Chowdhury, S; Soni, B. Beyond words: ESC-net revolutionizes VQA by elevating visual features and defying language priors. Comput Intell; 2024; 40,
19. Chowdhury, S; Soni, B. ENVQA: improving visual question answering model by enriching the visual feature. Eng Appl Artif Intell; 2025; 15,
20. Chowdhury, S; Soni, B. Qsfvqa: a time efficient, scalable and optimized vqa framework. Arab J Sci Eng; 2023; 48,
21. Chowdhury, S; Soni, B. Handling language prior and compositional reasoning issues in visual question answering system. Neurocomputing; 2025; 28,
22. Hiebel N et al. CLISTER: a corpus for semantic textual similarity in French clinical narratives. In LREC 2022-13th Language Resources and Evaluation Conference. European Language Resources Association. 2022.
23. Elsabagh, M; Farhan, M; Gafar, M. Cross-projects software defect prediction using spotted hyena optimizer algorithm. SN Appl Sci; 2020; 2,
24. Ghasemi, M et al. A new firefly algorithm with improved global exploration and convergence with application to engineering optimization. Decis Anal J; 2022; 5, 100125.
25. Nematzadeh, S et al. Tuning hyperparameters of machine learning algorithms and deep neural networks using metaheuristics: a bioinformatics study on biomedical and biological cases. Comput Biol Chem; 2022; 97, 107619.
26. Seyyedabbasi, A; Kiani, F. Sand Cat swarm optimization: a nature-inspired algorithm to solve global optimization problems. Eng Comput; 2023; 39,
27. Kiani, F; Anka, FA; Erenel, F. PSCSO: Enhanced sand cat swarm optimization inspired by the political system to solve complex problems. Adv Eng Softw; 2023; 178, 103423.
28. Seyyedabbasi, A; Kiani, F. I-GWO and Ex-GWO: improved algorithms of the Grey Wolf Optimizer to solve global optimization problems. Eng Comput; 2021; 37,
29. Zitouni, F et al. The archerfish hunting optimizer: a novel metaheuristic algorithm for global optimization. Arab J Sci Eng; 2022; 47,
30. Saremi, S; Mirjalili, S; Lewis, A. Biogeography-based optimisation with chaos. Neural Comput Appl; 2014; 25,
31. Abualigah, L et al. Aquila optimizer: a novel meta-heuristic optimization algorithm. Comput Ind Eng; 2021; 157, 107250.
32. Elsabagh, MA; Farhan, MS; Gafar, MG. Meta-heuristic optimization algorithm for predicting software defects. Expert Syst; 2021; 38,
33. Abd Elaziz, M; Yousri, D; Mirjalili, S. A hybrid Harris hawks-moth-flame optimization algorithm including fractional-order chaos maps and evolutionary population dynamics. Adv Eng Softw; 2021; 154, 102973.
34. Aydemir, SB. A novel arithmetic optimization algorithm based on chaotic maps for global optimization. Evol Intel; 2023; 16,
35. Yang, D; Liu, Z; Zhou, J. Chaos optimization algorithms based on chaotic maps with different probability distribution and search speed for global optimization. Commun Nonlinear Sci Numer Simul; 2014; 19,
36. Mirjalili, S; Mirjalili, SM; Lewis, A. Grey wolf optimizer. Adv Eng Softw; 2014; 69, pp. 46-61.
37. Vaswani A. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
38. Bowman SR et al. A large annotated corpus for learning natural language inference. arXiv preprint. 2015. arXiv:1508.05326, 2015.
39. Kingma DP. Adam: a method for stochastic optimization. arXiv preprint. 2014. arXiv:1412.6980
40. Hassan E, Saber A, El-kenawy ESM, Bhatnagar R, Shams MY. Early detection of black fungus using deep learning models for efficient medical diagnosis. In 2024 International Conference on Emerging Techniques in Computational Intelligence (ICETCI). IEEE. 2024; pp. 426–431
41. Rajpurkar P. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint. 2016. arXiv:1606.05250
42. Lai G et al. Race: large-scale reading comprehension dataset from examinations. arXiv preprint. 2017. arXiv:1704.04683
43. Yang Z. XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint. 2019. arXiv:1906.08237
44. Raja RS. A Multi-Level NLP Framework for Medical Concept Mapping in Healthcare AI Systems. In 2025 IEEE 4th International Conference on AI in Cybersecurity (ICAIC). IEEE. 2025; pp. 1–3.
45. Saber, A; Elbedwehy, S; Awad, WA; Hassan, E. An optimized ensemble model based on meta-heuristic algorithms for effective detection and classification of breast tumors. Neural Comput Appl; 2025; 37,
46. Kumar K. Task-Finding Semantic Textual Similarity Dataset. https://www.kaggle.com/datasets/kanhataak/task-finding-semantic-textual-similarity/data. Accessed Jan 2025.
47. Devastator T. Textual Entailment Dataset. https://www.kaggle.com/datasets/thedevastator/textual-entailment-dataset. Accessed Jan 2025.
48. Shi, Y; Qi, Y; Lv, L; Liang, D. A particle swarm optimisation with linearly decreasing weight for real-time traffic signal control. Machines; 2021; 9,
49. Georgioudakis, M; Plevris, V. A comparative study of differential evolution variants in constrained structural optimization. Front Built Environ; 2020; 9,
50. Kiani, F; Nematzadeh, S; Anka, FA; Findikli, MA. Chaotic sand cat swarm optimization. Mathematics; 2023; 11,
51. Hu, Y; Xiong, R; Li, J; Zhou, C; Wu, Q. An improved sand cat swarm operation and its application in engineering. IEEE Access; 2023; 5,
52. Elbedwehy, S; Hassan, E; Saber, A; Elmonier, R. Integrating neural networks with advanced optimization techniques for accurate kidney disease diagnosis. Sci Rep; 2024; 14,
53. Alnowaiser, K; Saber, A; Hassan, E; Awad, WA. An optimized model based on adaptive convolutional neural network and grey wolf algorithm for breast cancer diagnosis. PLoS ONE; 2024; 19,
54. Hassan, E; Elbedwehy, S; Shams, MY; Abd El-Hafeez, T; El-Rashidy, N. Optimizing poultry audio signal classification with deep learning and burn layer fusion. J Big Data; 2024; 11,
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.