Content area
Code reuse through cloning is common in software development, yet excessive or unchecked cloning can harm maintainability and raise plagiarism concerns. Detecting the proportion of reused (cloned) code in a software project, especially across different programming languages, is a challenging task. This paper defines code reuse proportion detection as measuring how much code in a target program is cloned (identical or similar) from elsewhere. Existing code clone detection techniques perform well in single-language settings but struggle with cross-language clones and do not directly quantify reuse proportion. To address these gaps, we propose a novel cross-language function-level code clone detection approach using a dual embedding Siamese neural network. Our method represents code in Java and Python using a unified abstract syntax structure and semantic embeddings, then uses a Siamese deep network to learn language-agnostic similarities. We also introduce a metric to quantify the clone-based reuse ratio for each function or program. Experiments on three public datasets (including a Java clone benchmark, a Python code clone corpus, and a cross-language Java–Python clone dataset) show that our approach outperforms ten baseline methods, including state-of-the-art and classical clone detectors. Ablation studies confirm the contribution of each component (structural embeddings, cross-language alignment, and contrastive learning) to performance gains. Our model achieves new state-of-the-art accuracy in code clone detection, enabling precise measurement of code reuse. These results demonstrate that the proposed approach can effectively detect cross-language code clones and quantify reuse proportion, benefiting software plagiarism detection and code quality assessment in multi-language projects.
Introduction
Software developers often reuse code by copying and adapting existing code fragments, a practice that produces code clones (duplicate or near-duplicate code segments) (Roy and Cordy 2007). Studies have found that clones can constitute between 5%–23% of a software system’s code (White et al. 2016; Koschke 2007), highlighting that code reuse via cloning is pervasive. While reusing code can improve productivity, unmanaged cloning is generally associated with reduced code quality and higher maintenance costs (Juergens et al. 2009). Cloned code fragments may diverge over time, leading to inconsistent bug fixes and increased effort to update all copies when changes are needed (Krinke 2007). In collaborative or multi-language software projects, functionally similar code may appear in different languages, complicating detection of redundancy (Kochhar et al. 2016). Furthermore, in academic settings, cloning manifests as source code plagiarism, where students reuse code from peers or online sources, violating originality expectations. Reliable detection of such cross-language or cross-project code reuse is crucial for academic integrity and open-source license compliance (Prechelt et al. 2002). However, accurately detecting and quantifying code reuse proportion (i.e. the percentage of code that is cloned from elsewhere) remains challenging, especially when clones span different programming languages or undergo substantial modifications.
Early approaches to code clone detection focused on single-language code and often relied on textual or lexical similarity. Classical tools like CCFinder (Kamiya et al. 2002), Deckard (Jiang et al. 2007), NiCad (Roy and Cordy 2008), and SourcererCC (Sajnani et al. 2016) detect Type-I/II/III clones (exact or near-miss clones within the same language) using token matching, abstract syntax trees (ASTs), or software metrics. These methods have proven effective for finding duplicated code blocks in large codebases. For example, CCFinder used token sequences to locate clones, and Deckard compared AST subtrees to find syntactic clones. While such tools achieve high precision on straightforward duplicates, they struggle with semantic clones (Type-IV) where code is functionally similar but lexically different (Zhang and Saber 2025). Moreover, traditional detectors are generally limited to one language or require language-specific analyzers, making cross-language clone detection difficult. Some research on academic plagiarism detection can compare code structure across assignments, but these systems typically compare programs in the same language or rely on simple string matching, failing when a plagiarist translates code logic into another language.
In recent years, the software engineering community has adopted machine learning and deep learning techniques to improve clone detection. Deep learning models can automatically learn code features that capture syntax and semantics beyond simple text similarity. For instance, White et al. (2016) trained deep neural networks on code fragment pairs to detect functional similarity. Other studies have leveraged recurrent neural networks and autoencoders over ASTs or tokens to identify clones (Li et al. 2017; Wei and Li 2017). Zhou et al. and Zhao et al. introduced attention mechanisms into clone detection models to better align similar code structures (Meng and Liu 2020; Zhao and Huang 2018). More recently, graph-based neural networks have shown success by modeling code’s program dependence graphs or ASTs with control/data flow for clone detection. For example, Wang et al. (2020) proposed using a flow-augmented AST graph and graph neural networks to detect clones, achieving state-of-the-art results on a standard Java clone benchmark. These learning-based approaches significantly improved recall for Type-III/IV clones within a single language. In addition, pre-trained code representation models like CodeBERT (Feng et al. 2020) and GraphCodeBERT (Guo et al. 2020) have been fine-tuned for clone detection tasks, reaching over 94% F1-score on benchmarks of Java code clones (Wang et al. 2021). Despite these advances, the majority of learning-based clone detectors still assume the code pairs are written in the same language (Lei et al. 2022). The clone detection problem becomes more complex in cross-language scenarios, where differences in syntax and libraries can obscure underlying functional similarities.
Only a few works have addressed cross-language code clone detection. Nafi et al. (2019) proposed CLCDSA, a deep learning approach using hand-engineered syntactic features and API documentation mapping to detect clones between Java, C and Python code. Perez and Chiba (2019) introduced a Siamese LSTM model that transforms code ASTs into vectors and compares them to find Java–Python clone pairs. While these pioneering cross-language models demonstrated the feasibility of detecting clones across languages, their performance was limited (F1 around 0.65–0.66), mainly due to information loss in the simplistic feature representations and difficulties in aligning code semantics across languages. Subsequent research has explored intermediate representations or unified embeddings to bridge the language gap. For example, Cheng et al. (2017) developed CLCMiner, which mines software repositories to detect cross-language clone pairs without a common intermediate language. Vislavski et al. (2018) proposed LICCA, using a combination of language-agnostic AST features and identifier mappings for cross-language clones. More recent methods leverage neural representation learning: Tao et al. (2022) introduced C4, a contrastive learning model to map code from different languages into a joint embedding space for clone detection. Fang et al. (2023) presented TCCCD, a triplet-network model for cross-language clones. Despite these efforts, cross-language clone detection remains less accurate than single-language detection, and importantly, prior works have mostly focused on identifying whether two code fragments are clones, not on quantifying how much of a given program is composed of reused code. In contexts like plagiarism detection or open-source license compliance, knowing the proportion of external code reuse (e.g., “Program X contains 30% code cloned from known sources”) is as crucial as detection itself. Existing clone detectors do not directly output such a measure; examiners must manually inspect clone reports to estimate reuse percentage. This motivates a dedicated approach for code reuse proportion detection that integrates clone detection with a quantification step.
In this paper, we present a novel method to detect function-level code clones across languages (specifically Java and Python) and to quantify the reuse proportion of code. We design a deep learning framework that combines the strengths of structural code analysis and representation learning to overcome the challenges identified above. The main contributions of our work are summarized as follows:
We propose a new Siamese neural network architecture for cross-language code clone detection. Our model accepts code in different languages, generates a unified embedding for each function, and compares embeddings using a Siamese network with a contrastive loss. Unlike prior cross-language approaches that depend on handcrafted mappings or language-specific analyzers, our framework automatically learns language-agnostic representations from data through a unified AST-based neural encoder. This design achieves higher semantic alignment between languages, enabling accurate detection of Java–Python clones even with divergent syntax.
We introduce a language-independent code representation technique that abstracts code into an intermediate form capturing both syntactic structure and semantic cues. By extending traditional ASTs with semantic annotations (such as normalized operators and API usage), we unify code from different languages into a common structural space. This unified representation bridges the lexical and structural gap between languages, going beyond earlier normalization or token-based methods that lack deep semantic abstraction.
We define a novel metric and procedure to quantify the proportion of reused (cloned) code in a given program or codebase. Building on our clone detection model, we calculate the reuse ratio for each function as the fraction of its code tokens identified as clones from a reference repository. To our knowledge, this is the first work to integrate clone detection and quantitative reuse measurement within a unified framework, transforming qualitative clone identification into a measurable software reuse index applicable to plagiarism detection and code quality analysis.
We conduct extensive experiments on three public datasets to validate our approach, including a classical Java clone benchmark, a Python clone dataset from online judge problems, and a cross-language Java–Python clone set derived from XLCoST. Compared with ten state-of-the-art baselines, our model consistently achieves superior accuracy and F1 scores, setting a new benchmark for cross-language clone detection. The broad and systematic evaluation, together with detailed ablation studies, provides strong empirical evidence for the novelty and effectiveness of our contributions.
Related work
Code clone detection techniques
Code clone detection has been an active research area for over two decades, and numerous techniques have been developed (Roy and Cordy 2008; Rattan et al. 2013). Early approaches can be categorized by the code representation they compare: textual, lexical, syntactic, or semantic. Text-based clone detectors operate on raw source code strings, identifying clones by exact or parameterized string matching. A classic example is Baker’s duplication detection using suffix trees (Baker 1995), which could rapidly find exact duplicates in code. Such text-based methods (including simple line-based tools) can catch Type-I clones (identical code) but are brittle to renaming or formatting changes (Zhu et al. 2022).
Lexical approaches tokenize the source code and detect clones by comparing token sequences. Tools like CCFinder (Kamiya et al. 2002) and PMD (Copy/Paste Detector) fall in this category. CCFinder transforms code into a sequence of uniform tokens and identifies similar token subsequences, allowing detection of Type-II clones (which differ only in identifiers or literals). These approaches improve robustness to minor edits, but still may not handle reordered code or inserted statements well. Some token-based techniques use hash signatures (fingerprints) of code substrings to find matches efficiently in large codebases (Rabin 1981). Metric-based methods compute numerical characteristics (e.g., number of operators, operands, structural metrics) for code fragments and consider fragments with similar metric vectors as clones (Cordy and Roy 2011). While fast, metric methods can miss clones with similar functionality but dissimilar metric profiles.
Syntax-based approaches leverage the source code’s abstract syntax tree. A representative tool is Deckard (Jiang et al. 2007), which parses code into ASTs and then uses a tree hashing and clustering technique to identify similar subtrees (code fragments). AST-based clone detection can detect Type-III clones, including those with added/deleted statements, since the tree structure can still show significant overlap. NiCad (Roy and Cordy 2008) is another syntax-based tool that extracts pretty-printed code fragments (functions) and uses flexible pretty-print normalization plus longest common subsequence (LCS) matching to find clones with consistent renaming. AST-based methods are generally more robust to formatting and minor edits, but they must handle the computational complexity of tree matching. Advancements like syntactic fingerprinting (Chilowicz et al. 2009) condense ASTs into characteristic strings to enable faster matching. Some approaches also use program dependency graphs (PDG) to detect semantic similarity beyond syntax, comparing data-flow and control-flow structures (Pham et al. 2009). These can find clones that are logically similar even if syntactically different, but comparing graphs is expensive and sensitive to program input differences.
Semantic clone detection aims at identifying code that performs the same functionality, even if implemented with different algorithms or idioms (Type-IV clones). Traditional static analysis struggles here, so researchers have explored dynamic analysis and symbolic execution to detect semantic clones (e.g., checking if two code fragments produce the same outputs for various inputs) (Svajlenko and Roy 2015). However, dynamic semantic approaches face scalability issues and are not widely used in practice.
Machine learning and deep learning for clone detection
The application of machine learning has significantly advanced clone detection capabilities. Unsupervised learning techniques were initially applied, for example clustering code fragments based on feature vectors to suggest groups of similar code (Cordy and Roy 2011). In the last decade, deep learning models have become prominent. These models learn vector representations (embeddings) of code that capture syntax and semantics, enabling more effective clone identification through similarity in the learned vector space.
One pioneering work by White et al. (2016) used an autoencoder to learn features of code fragments and a deep neural network (DNN) to classify if two code fragments are clones. They represented code as tokens and trained on known clone pairs, achieving good accuracy on simple clone benchmarks. Following this, researchers began exploring neural models over structural representations of code. AST-based neural models: Zhang et al. (2019) proposed ASTNN, which splits functions into statement-level AST subtrees and uses a recursive neural network to encode each subtree, followed by a sequential model (RNN) to capture the flow of the program (Zhang et al. 2019). ASTNN showed strong results on detecting both syntactic and semantic clones in Java programs, thanks to its ability to model structural context. Wei and Li (2017) introduced a Supervised Deep Features model (often called CDLH) that combines lexical and syntactic information in a hybrid deep network to detect functional clones, reporting improved recall for Type-IV clones (Wei and Li 2017).
Tree and graph neural networks: With the development of graph neural networks (GNNs), code clone detection has benefited from richer graph-based representations. Wang et al. (2020) designed a GNN operating on a flow-augmented AST, which incorporates control flow and data flow edges into the AST graph, and then used a graph matching network to compare two code graphs. Their approach significantly improved clone detection for cases with non-contiguous similar code. Similarly, Zhao and Huang (2018) developed DeepSim, which uses a combination of AST-based LSTM and a similarity matching network to detect functional similarity in code (Zhao and Huang 2018). There are also approaches using convolutional neural networks (CNNs) on code representations: for example, the TBCNN by Mou et al. (2016) performs a convolution over the AST structure to encode a program, which was originally used for code classification but is applicable to clone detection. Recent work by Wan et al. (2025) even converts code ASTs into semantic images and applies CNN image similarity to detect semantic clones with high accuracy. These deep learning methods collectively have pushed clone detection performance higher; many report Precision/Recall over 90% on benchmarks like BigCloneBench.
Another noteworthy trend is the use of pre-trained code models. Models such as CodeBERT (Feng et al. 2020), GraphCodeBERT (Guo et al. 2020), CodeT5 (Wang et al. 2021), and PLBART have been trained on massive code corpora to learn general code representations. Fine-tuning these models for clone detection has yielded excellent results, often outperforming specialized models on standard datasets. For instance, fine-tuned CodeBERT achieved an F1 of 0.95 on BigCloneBench clones, learning generalizable patterns of similarity (Svyatkovskiy et al. 2020). Interpretability of such deep models is an ongoing challenge; some recent studies attempt to interpret what code features CodeBERT learns for clone detection (Abid et al. 2023). Nonetheless, the consensus in recent literature is that deep learning approaches, especially those combining structural code analysis with neural embeddings, represent the state-of-the-art for monolingual clone detection (Lei et al. 2022).
Cross-language clone detection
Detecting clones across programming languages (also known as cross-lingual clone detection) is inherently difficult due to differences in syntax, semantics, and libraries between languages. A straightforward approach might translate one language to another or to a common intermediate representation (IR) and then apply monolingual clone detection, but accurate automatic code translation is itself non-trivial. Some early approaches tackled specific language pairs by leveraging similarities (e.g., C vs. Java clones, which share similar syntax and libraries) (Keivanloo et al. 2011).
More systematic research into cross-language clones has emerged in the last few years. CLCMiner (Cheng et al. 2017) avoided the need for a unified IR by mining version histories: if a project had implementations in two languages (e.g., a C++ version and a Java version of the same algorithm), changes committed with the same message in both codebases could indicate corresponding clone fragments. This approach, however, depends on multi-language projects with parallel histories and cannot find clones in unrelated projects. LICCA (Vislavski et al. 2018) tried to generalize clone features across languages by using language-agnostic abstraction (e.g., normalizing identifiers and literals, focusing on structural metrics common to both languages). It reported moderate success in Java–C clone detection.
Table 1. Distinguishing the proposed method from prior cross-language clone detection approaches (a)
Method | Input Representation | Learning Architecture | Output |
|---|---|---|---|
CLCDSA | Hand-engineered AST & API features | Siamese DNN | Clone / Non-clone |
TCCCD | Token & semantic embeddings | Triplet Network | Clone similarity |
Ours | Unified AST with semantic annotations | Contrastive Siamese + Attention | Clone detection + Reuse proportion |
Modern cross-language clone detectors employ machine learning to overcome the syntactic disparities. Nafi et al. (2019) – CLCDSA: This method extracts syntactic features (like number of nodes of each AST type) and uses neural networks in a Siamese setup, combined with API documentation alignment between languages. By mapping similar API calls (e.g., Java java.util.List.add <-> Python list.append) and comparing feature vectors, CLCDSA could detect some cross-language clones, but it was limited by the completeness of the manual API mappings and feature design. Perez and Chiba (2019): They introduced perhaps the first neural cross-language clone model using end-to-end learning. They trained a Siamese bidirectional LSTM to encode Java and Python ASTs (transformed into sequences) and used a hashing layer to reduce dimensionality, followed by a feed-forward comparator. Their model achieved around 66% F1 on a Java-Python clone dataset, indicating the difficulty of the task. A key insight was that using identical network weights for both languages helped, but the simple sequential AST encoding missed deeper semantic similarities. Yahya and Kim (2022) improved on this with CLCD-I, which uses InferCode (a self-supervised AST encoder Bui et al. 2021) to produce embeddings for code in each language, then a Siamese DNN to classify clone pairs. CLCD-I reached 0.78 F1 (with high precision but moderate recall) for Java-Python clone, outperforming earlier works. This indicates that pre-training on unlabeled code (InferCode) provided a better initial representation for each language, and the Siamese architecture learned to align those representations.
Recent research has taken advantage of large language models and advanced contrastive learning. C4 (Hu et al., 2022) employed contrastive learning to explicitly bring representations of known clone pairs closer while pushing non-clones apart (Tao et al. 2022). TCCCD (Fang et al. 2023) used a triplet loss (anchor–positive clone–negative nonclone) to improve cross-language discrimination. These models report F1 improvements into the 0.70–0.85 range for cross-language tasks. Additionally, Large Language Models (LLMs) like ChatGPT/GPT-3 have been evaluated on clone detection: they can sometimes identify if two code snippets are similar when prompted, especially for simple algorithms. However, a recent study found that LLMs struggle with more complex code and often do not truly understand the code’s semantics in a multi-language context, leading to inconsistent performance (Moumoula et al. 2024). It was shown that fine-tuned embedding models still outperform LLMs on challenging cross-language clone benchmarks by a significant margin.
In summary, cross-language clone detection has progressed but lags behind single-language clone detection in accuracy. No existing approach perfectly aligns code semantics across arbitrary languages; most work either focuses on closely related languages or uses extensive hand-crafted mappings. Moreover, none of the above methods explicitly computes the proportion of code that is cloned – they simply classify pairs or retrieve similar code. Our work builds on insights from these related studies: we use a Siamese neural architecture like Nafi et al. (2019); Perez and Chiba (2019), but we enhance it with a unified structural code embedding (inspired by Bui et al. 2021) and a contrastive training objective (similar to Tao et al. 2022). By doing so, we aim to push cross-language clone detection performance closer to parity with single-language detection. Additionally, by aggregating clone detection outputs, we introduce a novel capability to quantify code reuse extent, extending clone detection research into the quantitative domain of software reuse analysis. This fills an important gap, as identified in both the software engineering and academic integrity communities, for tools that not only detect plagiarism or cloning but also measure it in meaningful terms (e.g., “X% of the code is cloned”) for decision making.
[See PDF for image]
Fig. 1
Overview of cross-language clone detection and reuse proportion
To better highlight the novelty of our proposed reuse quantification method, we summarize in Tables 1 and 2 a comparison between our approach and representative cross-language clone detection methods, including CLCDSA (Nafi et al. 2019) and TCCCD (Fang et al. 2023). The table contrasts their main characteristics in terms of input representation, language coverage, learning architecture, and output capability.
Table 2. Distinguishing the proposed method from prior cross-language clone detection approaches (b)
Method | Language Coverage | Reuse Quantification |
|---|---|---|
CLCDSA | Java, C#, Python | |
TCCCD | Multi-language | |
Ours | Java–Python | ✓ |
Methodology
To facilitate understanding, we first provide a conceptual overview of the proposed approach before presenting the technical details in the following subsections. Our proposed approach consists of two main components: (1) a cross-language code embedding and clone detection model that identifies whether two given code fragments (functions) are clones, and (2) a reuse proportion computation that aggregates clone detection results to estimate how much of a program is reused. The overall framework is illustrated in Fig. 1, which shows how code in different languages is processed into a unified representation and compared by a Siamese network to detect clones. In the following subsections, we detail the code representation, the Siamese neural network architecture and training procedure, the algorithm for determining code reuse proportion, and the complexity and reproducibility aspects of our method.
Unified code representation and embedding
To compare code across languages effectively, we first transform each code fragment (function) into a language-agnostic intermediate representation. We chose to build on the abstract syntax tree (AST) because it preserves structural and syntactic information while abstracting away superficial differences like formatting. However, a vanilla AST still contains language-specific node types and idioms. We therefore designed a unified abstract syntax representation (UASR) that normalizes elements common to most imperative languages. The UASR is essentially an enhanced AST with additional normalization:
Normalized Identifiers and Literals: All user-defined identifiers (variable names, function names) are normalized to a generic token (e.g., VAR, FUNC) with subscripts for consistency within a fragment. For example, sum = 0 in Python and int total = 0 in Java would both yield an assignment node with a generic variable name. Literal values (numbers, strings) are similarly replaced with generic tokens (NUM, STR, etc.), optionally tagged by type. This removes language-specific naming without losing structure.
Unified AST Node Types: We map language-specific AST node types to a common set. For instance, a Python For loop and a Java enhanced for loop are both mapped to a generic ForLoop node type with comparable child structures (initialization, condition, iteration, body) even though Python’s AST represents it differently. Similarly, a method/function definition in any language becomes a FunctionDef node with children for parameters, body, etc. Constructs that have no clear counterpart in the other language (e.g., Python’s list comprehension or Java’s do-while loop) are still included but marked with language tags so that the model can learn to handle them. We have defined a mapping for the syntactic constructs that Java and Python share (which cover the majority of typical code logic: expressions, loops, conditionals, function calls, etc.). For better clarity, we illustrate how the unified AST representation works across languages using a simple example. Consider the following pair of functions that compute a factorial number: one written in Java and one in Python. In the original ASTs, Java represents the loop construct as a “ForStatement” node with initialization, condition, and update subnodes, while Python represents it as a “For” node with an iterable and a body. Under our unified AST mapping, both are normalized into a generic “ForLoop” node with child nodes Init, Condition/Iterable, Body. Similarly, variable declarations such as int result = 1; in Java and result = 1 in Python are both mapped into an “Assign(VAR, NUM)” structure. This example demonstrates how different syntactic structures are harmonized into a single abstract representation, enabling the model to treat functionally equivalent code consistently across languages.
Semantic Annotations: We augment the AST with semantic nodes to capture important information like API calls and operations. For example, we introduce a generic Call node whose children include the normalized function name and arguments, regardless of whether it’s Java method invocation or Python function call. If certain API calls are known equivalents (like Java System.out.println Python print), we add an annotation linking those. We also represent infix operations (e.g., arithmetic, comparisons) with generic operator nodes (like Op_ADD for + in any language). These annotations help the model learn correspondences such as both code fragments using a sort function or both performing a similar arithmetic operation, even if the syntax differs.
1
where are the children of node n (ordered as in the AST), denotes vector concatenation, is a learned weight matrix, a bias, and is a non-linear activation (we use ReLU). This is a simple one-layer feed-forward aggregation; in practice, we found it sufficient when used in combination with an attention mechanism (described below). For nodes with variable number of children, the concatenation ensures a positional encoding of each child. Equation (1) effectively performs a tree convolution over the children vectors, compressing their information into the parent vector.To make the representation focus on important parts of the code, we integrate an attention mechanism over the AST nodes. Following the idea of weighted AST aggregation, we compute an attention weight for each child node of n as:
2
Here u is a learned global context vector (same dimension as node vectors) that indicates what features are generally important. Equation 2 is a softmax assigning higher weight to child nodes whose vector aligns more with u. We then take a weighted sum of child vectors rather than a simple concatenation or average when computing parent vectors:3
where and are new learned parameters. This attention pooling (3) means that subtrees likely to be semantically significant (for example, the loop body vs. a loop counter) get more influence in the parent representation. Using attention makes the model lighter and potentially requires less training data, as it can learn to emphasize key parts of the code (e.g., function calls, computations) that signal similarity.After processing the entire tree, we obtain an embedding for the root node (function) which represents the whole code fragment. Let E(f) denote the embedding of function f produced by the above tree encoder (with attention). These embeddings are in (we use as the dimension in our implementation). Importantly, the encoder has the same parameters regardless of the programming language of the input code, thanks to the unified node representations. We do not train separate encoders per language; the same model processes Java or Python ASTs. This design choice ensures that the embeddings of similar code from different languages will naturally align in the vector space, to the extent that the unified AST captures their similarity. We also leverage self-supervised pre-training for this encoder: prior to training the clone detector, we pre-train the encoder on unlabeled code from both languages using the InferCode objective of predicting subtrees. This helps initialize E(f) to produce meaningful vectors even before clone pair supervision.
Siamese network architecture
With a way to convert any function f (Java or Python) into an embedding E(f), we now describe the Siamese neural network that determines if two code fragments are clones. A Siamese network consists of two identical subnetworks that process two inputs in parallel, followed by a comparison layer that computes a similarity or distance. In our model, the subnetwork is essentially the code encoder described above (including the AST embedding and attention). We feed two code fragments and (possibly from different languages) into the Siamese twin encoders, which share all weights. This yields embeddings and . Sharing the encoder weights ensures that the vector representations live in the same feature space and that similar code patterns produce similar vectors regardless of language. This is crucial; if we allowed separate embeddings per language, the model might learn disjoint representations that are not directly comparable. By using one unified encoder, we enforce that, say, a loop or sort algorithm looks alike in the embedding space whether it came from Java or Python.
Next, we need to compare and to judge if the code fragments are clones. A straightforward approach is to compute a distance (such as Euclidean distance or cosine similarity) between the two vectors and threshold it. In our architecture, we opt for a learned comparator. We construct a feature vector from and as , where is element-wise absolute difference and is element-wise product (Hadamard product). These are common feature transformations in Siamese setups to capture both distance and commonality between vectors. The combined vector h is fed into a small feed-forward network (FFN) with one hidden layer to produce an output score s. The FFN essentially learns to weigh the differences and similarities across dimensions. Finally, s is passed through a sigmoid activation to produce a probability that the pair is a clone (1 for clone, 0 for non-clone).
We train the Siamese network using a contrastive loss function on labeled pairs. For a pair of functions with label (clones) or (not clones), let be the Euclidean distance between their embeddings. The contrastive loss L is:
4
where m is a margin parameter. Intuitively, if (they are clones), the loss is , pushing the network to make D small (embedding distance close) for clone pairs. If (non-clone), the loss is . This term is zero as long as D exceeds the margin m, meaning if the embeddings are far apart by at least m, we incur no loss (the network has done enough to separate them). If D is within m for a non-clone pair, there is a penalty pushing them apart. We set in our experiments, as embedding vectors are normalized to have unit length (so distances lie in [0, 2] for our use of Euclidean). The total training loss is the sum of L over all training pairs. We found that using contrastive loss yields more stable training than a binary cross-entropy on the output, because it directly operates on embedding distances. It effectively creates a representation space where clones cluster together and non-clones are well separated by at least margin m.To further improve learning, we generate hard negative pairs during training. Besides true clone pairs from the ground truth, we include pairs of functions that are superficially similar but not true clones as negative examples. For instance, two functions with the same name but different behavior, or two solutions to different tasks that happen to share some code patterns. We mine such pairs by random mixing and by using our current model’s confusion: at each epoch, we identify some non-clone pairs that the model erroneously scores as high similarity and include them (with ) in the next epoch’s training batch. This hard-negative mining forces the model to refine the decision boundary between actual clones and coincidentally similar code.
After training, the Siamese network outputs a similarity measure. At inference, given any two code fragments , we compute and classify them as clones if , where T is a chosen threshold. We set T based on a validation set to balance precision and recall (for example, in our experiments gave a good F1 balance). This yields a binary decision of “clone” or “not clone” for the pair. We can also interpret from the network as the confidence that they are clones.
Code reuse proportion detection algorithm
Beyond pairwise clone detection, our goal is to compute the proportion of code in a subject program that is reused (cloned) from elsewhere. We define the reuse proportion for a code fragment (or entire program) as the fraction of its code tokens that can be mapped to one or more cloned segments originating outside the fragment. In simpler terms, if we highlight all parts of the code that are detected as clones from an external source or library, what percentage of the code is highlighted? This metric is crucial in plagiarism detection (to see how much of a submission is copied) and in code quality analysis (to identify overly copied projects).
To compute this, we integrate our clone detector in an algorithmic process. Assume we have a target program P (which could be one file or a collection of functions) for which we want the reuse proportion, and a reference repository R of source code that represents “potential sources” of clones (e.g., a database of known code, or other projects, or in case of plagiarism, other student submissions or open-source code). If the goal is to detect self-contained reuse within a project (finding duplicates inside the project), R can be the same as P (intra-project clones). For plagiarism detection, R would be other students’ code or known solutions. Our approach is agnostic to this choice.
We break down the detection into function-level units. Let be the set of functions in program P (extracted via parsing). Let be the set of functions in the repository R (which could be very large, but note our method can embed and index them efficiently). We proceed as follows (pseudocode given in Algorithm 1):
Embed all functions: Compute for every function in using our unified encoder. This results in a database of embeddings for repository functions . We organize these in an index (for example, using a KD-tree or FAISS index for efficient nearest-neighbor search in the vector space).
Clone search for each function: For each function in the target program P, find its closest matches in R. We perform a k-nearest-neighbors search in the embedding space: retrieve the top k repository functions whose embeddings are nearest to . We then apply our clone classifier (distance threshold T) to these candidates to determine if any of them are clones of . If yes, record the clone with the highest similarity. If multiple distinct matches exist covering different parts of , we can consider them separately, but typically we assume one primary source if it exists. If no repository function passes the clone threshold for , we consider as original (not reused).
Reuse length calculation: If is deemed a clone of some , we estimate the extent of cloning within . One simple way is to use the token sequence of and : we can perform a longest common subsequence (LCS) or set of common token subsequences to approximate which parts of align with . Another approach is to split into basic blocks or lines and mark those that have high similarity to segments in (e.g., via local sequence alignment). For simplicity, we define as the number of tokens in that appear in the cloned segments (matched against after normalization). This can be obtained by an LCS-based diff between the two code fragments. Let be the total number of tokens in . We then compute the reuse ratio for function as . If multiple clone sources are detected for different portions of , can include all those portions (ensuring we don’t double-count overlapping regions). For a function with no detected clone, .
Aggregate to program-level proportion: The reuse proportion for the whole program P can be aggregated in various ways. One simple way is a weighted average by code length: , i.e., total cloned tokens over total tokens in the program. This gives a single percentage of code that is reused. Alternatively, if higher-level structure (like classes or files) is needed, we can compute per-file ratios similarly. In our experiments, we report as a percentage.
[See PDF for image]
Algorithm 1
Code Reuse Proportion Detection.
In line 13, ComputeAlignment(f_i, g) refers to an algorithm that identifies which tokens of align with tokens of g (the clone source). This could be an LCS-based diff as described. In our implementation, we used a dynamic programming LCS on the normalized token sequences of and g to mark the matching regions and count tokens.
The reuse proportion is a real number between 0 and 1 (we often express it as a percentage). An close to 0 means P is mostly original, whereas an close to 1 indicates that nearly all of P’s code was found as clones in the repository. For individual functions, provides a fine-grained view (e.g., a function might be 100% cloned while others are original, which a file-level average would dilute). Our method can report both levels: per-function clone ratios and overall program clone ratio.
Model training and implementation details
We implemented the Siamese network in PyTorch. The tree encoder uses a custom recursive function to compute (3) for each AST node; we limited the maximum tree depth to 100 (which sufficed for all our code, after inlining long expression chains to depth). The embedding dimension was chosen empirically for a good balance of representation power and efficiency. We pre-trained the encoder on an unlabeled corpus of 50,000 Java and 50,000 Python functions (not in the evaluation sets) using the self-supervised InferCode objective, which improved initial embeddings. For Siamese training, we constructed a training set of clone/non-clone pairs. We obtained known clone pairs from the BigCloneBench dataset and from online judge problems (for cross-language, we paired known equivalent solutions in Java/Python to the same problem). Non-clone pairs were sampled randomly across different projects/problems. We ensured that about 50% of training pairs were cross-language to teach the model cross-language alignment, and the rest were same-language (for general robustness). The training used the contrastive loss (4) with margin . We used the Adam optimizer with learning rate . Training converged in 20 epochs on our data ( 100K pairs), taking about 2 hours on a single GPU.
During training, we monitored validation F1 on a small held-out set of clone pairs and used early stopping when it plateaued. The threshold T for classification was set to 0.5 based on maximizing F1 on validation; this threshold is applied to the Euclidean distance D (so effectively we classify clone if similarity > some value since lower distance = higher similarity). Post-training, we fixed the encoder and used it for all experiments.
The reuse proportion algorithm was implemented as a separate module. For efficiency, we embedded the repository functions once and stored their vectors. We used nearest neighbors in Algorithm 1’s search, as we found in our development tests that looking at 3 candidates was usually enough to find the true clone if it existed (the correct clone was often the nearest neighbor due to the embedding quality). If an exact or very close clone wasn’t among the top 3, it usually meant no clone existed in the repository. We also experimented with larger k and found negligible difference in final proportions beyond but with higher runtime. The vector search was accelerated using FAISS (Facebook AI Similarity Search) for the largest dataset to keep lookup under a few milliseconds per function. The alignment step (line 13) using LCS runs in O(nm) time for sequences of length n and m, but since functions are relatively small (typically <300 tokens), this was fast in practice.
Reproducibility and transparency
To ensure the method is transparent and reproducible, we provide details and examples. The pseudocode and formulas above define the method rigorously. We have made our source code available in a public repository (link omitted for anonymity) including the implementation of the unified AST generator and training scripts. For clarity, consider a simple example: a Java function that computes factorial and a Python function doing the same. After normalization, both get represented with a similar UASR (loops or recursion marked similarly, etc.), yielding embeddings that are close. The Siamese network will correctly classify them as clones. If the Java function had 10 lines and 8 of those lines align with the Python code (just different syntax), the reuse proportion for that function would be 0.8 (80%). In our experiments section, we will present sample outputs illustrating such cases.
In terms of theoretical complexity, encoding a code fragment is linear in the size of its AST (each node processed once). Comparing two fragments is O(d) for the vector operations. Thus, pairwise clone check is efficient. The reuse proportion detection for an entire program P with n functions involves n searches in the repository index. If the repository has M functions, a naive search is O(nM), but using an approximate index typically yields sublinear search time in M. In our use-cases, this was manageable (e.g., for and , it completed in seconds).
In conclusion, our methodology integrates a robust cross-language clone detector with an algorithm to quantify code reuse. By focusing on function-level analysis and leveraging deep learning, it addresses both precision (few false positives due to strong semantic matching) and recall (catching tricky semantic clones). The next section will detail how we evaluate this methodology on multiple datasets and compare it against baselines.
Experiments
We conduct a series of experiments to evaluate the performance of our approach on code clone detection and reuse proportion measurement. We aim to answer the following key questions: (1) How does our model compare to existing methods in detecting code clones, both in single-language and cross-language scenarios? (2) Can our approach accurately quantify the proportion of reused code in a given program? (3) How do different components of our method (unified representation, attention, etc.) contribute to its effectiveness? (4) Is our model robust across different datasets and clone types?
To this end, we perform evaluations on three datasets, use multiple evaluation metrics, compare against a range of baseline methods, and design ablation and analysis studies. Below, we first describe the datasets and evaluation metrics. We then present the comparative results against baselines on each dataset. Next, we dive into additional analyses: an ablation study, an experiment on the effect of the detection threshold, and a breakdown of performance by clone type and language. Finally, we demonstrate the code reuse proportion output with a brief case study. All experiments were run on a workstation with an 8-core CPU and an NVIDIA Tesla V100 GPU; however, the clone detection inference is CPU-bound and fast, and the deep model is only needed for embedding computation.
Datasets
We use three publicly available datasets that together cover both single-language and cross-language clone detection tasks, as well as a variety of codebases and clone types:
BigCloneBench (BCB) – Java: BigCloneBench is a widely-used benchmark for code clone detection. It consists of 8,851 true clone pairs and millions of non-clone pairs drawn from 25,000 Java files in the IJaDataset repository. The clone pairs are annotated with clone types (I–IV) and cover a range of functionalities (e.g., various implementations of algorithms). We use the recommended evaluation subset of BCB: a balanced set of clone and non-clone function pairs with known labels (often focusing on functionality clones which are Type-III/IV). This dataset is excellent for evaluating recall on Type-III and IV clones in Java. In our experiments, we report results on BCB for the Java language. We treat this as a binary classification task on pairs, using the standard references for ground truth (any pair labeled as a true clone in BigCloneBench is positive).
GCJ-Python – Code Jam Python dataset (Svajlenko and Roy 2015): To evaluate on a different language and to simulate a plagiarism scenario, we assembled a dataset from Google Code Jam (GCJ) competition problems. We collected Python solutions for several Code Jam problems (from GCJ 2017 and 2018 online rounds). The dataset contains 1,200 Python functions solving 100 distinct problems (around 12 solutions per problem on average). We labeled clone pairs by problem: solutions to the same problem are considered semantically similar (clones) because they should solve the same task, whereas solutions to different problems are non-clones. This is similar to the dataset used in certain clone studies like Semantic Clone Benchmark or OJClone. Admittedly, not every solution to a problem is a near clone of another (they can differ significantly in approach), but typically there are common patterns. We primarily use this dataset to evaluate monolingual clone detection in Python – a language not represented in BigCloneBench – and to see how our model (trained partly on Java) generalizes. The positive pairs here are more semantic (Type-IV) in nature, given the variations in coding styles across different participants. We ensured no code overlap between this and other sets for fairness.
XLCoST Java–Python Cross-Language Clone Set: For cross-language evaluation, we leverage the XLCoST benchmark, which provides parallel code snippets in multiple languages for the same tasks. Specifically, we extracted a subset of 3,000 Java–Python clone pairs from XLCoST (each pair solves the same programming problem, one in Java, one in Python). Additionally, we generated an equal number of non-clone cross-language pairs by pairing code solutions of different problems (and validated that they are indeed different in functionality). This yields a balanced cross-language clone detection dataset (6,000 pairs total). This dataset tests the model’s capability to identify clones across our two target languages. The problems in XLCoST are relatively simple (e.g., algorithmic puzzles), which means many clone pairs are straightforward, but some are tricky if the languages use different approaches (iterative vs recursive, different library calls, etc.). We use this set to evaluate cross-language clone classification performance. We also use the raw code snippets from XLCoST as part of the reference repository R when computing reuse proportions for a cross-language scenario.
Before parsing, we applied consistent preprocessing steps across all datasets, including removing comments and import statements, normalizing identifiers and literals, and splitting multi-function files into individual functions. Functions with fewer than five lines or syntactic errors were excluded. To ensure comparability across datasets, we verified that each function node structure in the unified AST contained the same normalized node types and operator mappings.
During model training, key hyperparameters were fixed as follows: embedding dimension 256, batch size 64, learning rate 1e-4, optimizer Adam, and margin parameter in the contrastive loss. We trained each model for 20 epochs with early stopping based on validation F1-score.
While preprocessing was largely automated, we encountered challenges such as inconsistencies between Java and Python AST node granularity and the need to balance clone/non-clone pair ratios across datasets. These challenges were addressed by manual verification on random samples and controlled sampling procedures. We release all preprocessing scripts and configuration files to facilitate reproducibility.
Evaluation metrics
We employ several standard metrics to assess clone detection performance, ensuring a comprehensive evaluation:
Precision (P): The fraction of pairs our model labeled as clones that are actually true clones. This measures how many false positives we produce. High precision is important in practical clone detection to avoid overwhelming users with incorrect clone reports.
Recall (R): The fraction of true clone pairs that our model successfully identified. This reflects the model’s ability to catch clones (false negatives reduce recall). High recall is crucial especially in plagiarism detection – missing a plagiarized segment is a serious drawback.
F1-Score: The harmonic mean of Precision and Recall,
5
F1 gives a single measure of overall accuracy, balancing P and R. We report F1 as a primary metric for clone detection performance, as is common in prior work. For cross-language datasets (e.g., Java–Python pairs), performance metrics such as Precision, Recall, and F1-score are computed in the same way as for single-language clone detection. Each code pair is labeled as a ‘true clone’ if the two functions implement the same functionality, regardless of the programming language. A correct prediction means that the model successfully identifies such functional similarity across languages (true positive), while incorrect or missed detections correspond to false positives or false negatives, respectively. This consistent definition ensures that the reported metrics are directly comparable across monolingual and cross-lingual evaluations, making the results interpretable even for readers without a machine learning background.Accuracy (Acc): The proportion of all pair classifications (clone or not) that are correct. While accuracy can be less informative if clone vs non-clone classes are imbalanced, we include it for reference. In some datasets like BigCloneBench, the test set is balanced, so accuracy parallels F1. In others (like if we consider all possible pairs, non-clones far outnumber clones), accuracy is less meaningful. We will clarify class distributions when discussing results.
MCC (Matthews Correlation Coefficient): For a more rigorous evaluation especially in imbalanced settings, we use MCC, which considers true and false positives/negatives in a correlation coefficient ranging from -1 to 1. MCC is defined as
6
We compute MCC for scenarios like BigCloneBench where negative pairs are extremely large in number (for fairness, many works focus on recall at high precision instead). MCC provides a balanced view even if classes are skewed.Clone Recall @ k: In a retrieval context, one might consider how many true clones are found in the top-k similar results. For example, in BigCloneBench evaluations, recall@100% precision is sometimes used (i.e., how many of the known clones can be found before the first false positive). In our evaluation tables, we focus on P/R/F1 at a fixed threshold. However, in discussing results, we note if our method achieves 100% precision and what recall corresponds to that. This is akin to measuring along the precision-recall curve.
Each experiment table in the following subsections will typically list Precision, Recall, F1 (and sometimes Accuracy or MCC) for our method and baselines, for a given dataset. All metrics are computed at the pair level (for clone detection) except when explicitly stating aggregate measures.
Table 3. Clone detection results on BigCloneBench (Java)
Model | Precision | Recall | F1-score | Accuracy | MCC |
|---|---|---|---|---|---|
CCFinderX (Kamiya et al. 2002) | 91.3 | 63.5 | 74.8 | 95.2 | 0.74 |
Deckard (Jiang et al. 2007) | 84.7 | 59.1 | 69.6 | 93.0 | 0.70 |
NiCad (Roy and Cordy 2008) | 96.4 | 52.0 | 67.5 | 94.1 | 0.70 |
SourcererCC (Sajnani et al. 2016) | 88.5 | 71.2 | 78.9 | 96.0 | 0.79 |
RtvNN (DL by White et al. 2016) | 81.0 | 75.0 | 77.9 | 92.5 | 0.78 |
ASTNN (Zhang et al. 2019) | 89.5 | 81.6 | 85.4 | 97.1 | 0.85 |
FA-AST GNN (Wang et al. 2020) | 91.0 | 84.3 | 87.5 | 97.8 | 0.87 |
DeepSim (Zhao and Huang 2018) | 88.1 | 79.4 | 83.5 | 96.6 | 0.83 |
CodeBERT (fine-tuned) (Feng et al. 2020) | 94.0 | 90.5 | 92.2 | 98.9 | 0.92 |
Our Model | 96.7 | 92.4 | 94.5 | 99.2 | 0.94 |
Our model (CrossCodeClone) vs baselines. Bold indicates best, underline indicates second best. All values in %
Table 4. Clone detection results on the GCJ Python dataset
Model | Precision | Recall | F1-score | Accuracy |
|---|---|---|---|---|
herlock (token-based) (Saini et al. 2018) | 78.0 | 69.5 | 73.5 | 85.0 |
ASTNN (Python) (Zhang et al. 2019) | 85.2 | 77.6 | 81.2 | 89.1 |
CodeBERT (fine-tuned) (Feng et al. 2020) | 91.5 | 88.7 | 90.1 | 95.4 |
Our Model | 94.3 | 90.8 | 92.5 | 97.0 |
Main summarization results
Experiment 1: Java Clone Detection (BigCloneBench)
In our first experiment, we compare our approach against several baselines on the BigCloneBench Java dataset. We include both classical clone detectors and recent learning-based models as baselines: CCFinderX (improved version of CCFinder) (Kamiya et al. 2002), Deckard (Jiang et al. 2007), NiCad (Roy and Cordy 2008), SourcererCC (Sajnani et al. 2016), RtvNN (a deep learning clone detector by White et al. 2016), ASTNN (Zhang et al. 2019), FA-AST/GNN (Wang et al. 2020 GNN model) (Wang et al. 2020), CodeBERT fine-tuned for clones (Feng et al. 2020), and DeepSim (Zhao and Huang 2018). These represent a mix of token, AST, graph, and neural techniques. We ran or obtained results for these methods on the BCB evaluation set. For the learning models, we either trained them on the provided training part of BCB or used reported results if available (ensuring consistent evaluation conditions). Table 3 shows the performance on BigCloneBench (Java clones).
As seen, our model achieves the highest scores on all metrics for BigCloneBench. It attains Precision 96.7% and Recall 92.4%, yielding an F1 of 94.5%, which outperforms the best baseline (fine-tuned CodeBERT, F1 92.2%). This indicates that even for single-language Java clones, our method is highly effective. The classical tools have lower recall (e.g., NiCad finds many exact clones hence high precision 96.4% but misses many, recall 52.0%). SourcererCC and RtvNN provide a better balance (F1 78–79%). ASTNN, FA-AST GNN and DeepSim – all advanced learning models – push F1 into the mid-80s or higher. CodeBERT’s strong performance (P94, R90) is notable and aligns with recent observations that pre-trained models are very powerful on BigCloneBench. Our model likely benefits from the attention-enhanced AST embedding and possibly the combination of structural and semantic training. The MCC for our model is 0.94, indicating excellent correlation with ground truth. The accuracy of 99% is because clone pairs are a minority; still, it shows very few misclassifications overall. The difference in recall between CodeBERT and our model (90.5 vs 92.4) suggests our approach finds some clones that CodeBERT misses, perhaps due to structural cues that a generic transformer might not fully capture. We manually inspected a few cases: our model caught some Type-III clones where code structure was rearranged (e.g., loop unrolling) that CodeBERT apparently missed, likely because our AST-based representation preserved the logical flow better.
Table 5. Cross-language (Java–Python) clone detection results (XLCoST 3k pairs)
Model | Precision | Recall | F1-score |
|---|---|---|---|
CLCDSA (Nafi et al. 2019) | 67.1 | 64.8 | 65.9 |
MSR 2019 (Perez and Chiba 2019) | 66.0 | 82.5 | 73.4 |
C4 (Tao et al. 2022) | 79.3 | 77.0 | 78.1 |
CodeBERT (Feng et al. 2020) | 85.5 | 81.3 | 83.3 |
Our Model | 91.8 | 88.1 | 89.9 |
Experiment 2: Python Clone Detection (GCJ-Python)
In this experiment, we evaluate on the Python Code Jam dataset (monolingual clones in Python). Many clone detectors were designed for C/Java, but some can be adapted to Python. We compare our model with ASTNN (retrained for Python), a token-based clone detector (Sherlock, adapted for Python) (Saini et al. 2018), and a fine-tuned CodeBERT (pre-trained on multiple languages including Python) (Feng et al. 2020). Table 4 shows results on the Python clone dataset.
Our model again performs best, with F1 = 92.5%, a few points above CodeBERT’s 90.1%. The token-based Sherlock tool had difficulty (F1 73.5), likely because participants” solutions can use different variable names and even different approaches, so token similarity is limited. ASTNN did better (81.2 F1), indicating structural learning helps, but it didn’t reach CodeBERT or our model. The high precision of our model (94.3%) means it made very few false-positive clone claims among Python solutions; essentially, when it flags two Python programs as similar, it’s almost always correct that they solved the same problem. Its recall 90.8% shows it found most of the similar-solution pairs, slightly more than CodeBERT’s 88.7%. This suggests that even though our model’s training included primarily Java clones, the learned representation generalized well to Python (thanks to the unified AST and the fact we did include Python in pre-training). It’s promising that a model can be language-agnostic enough to excel in a language-specific task with minimal adjustment.
Experiment 3: Cross-Language Clone Detection (Java–Python)
We now evaluate on the cross-language clone pairs from the XLCoST subset (Java vs Python clones). Baselines here include the cross-language specific methods: CLCDSA (Nafi et al. 2019), Perez & Chiba’s LSTM model (Perez and Chiba 2019), and the recent C4 (contrastive cross-language model) (Tao et al. 2022). We also test CodeBERT (which is multi-language) by fine-tuning it on cross-language clone pairs (so it can directly compare a Java code and a Python code). Table 5 summarizes the results.
The cross-language task is evidently harder, as baseline F1 scores are generally lower than in single-language tasks. CLCDSA and Perez’s approach both hover around 66–73% F1, consistent with their original reports. Perez’s model had high recall (82.5%) but lower precision (66.0%), indicating many false positives, likely because their AST skip-gram embedding could generate similar vectors for non-clone pairs occasionally. CLCDSA was more balanced but both P and R in mid-60s. The C4 model improved into the high 70s. Fine-tuned CodeBERT did quite well (83.3 F1), showing that a strong multi-language transformer can align languages to a degree. Our model significantly outperforms these, with Precision 91.8% and Recall 88.1%, giving F1 89.9%. This is a large improvement over the previous best (CodeBERT’s 83.3), pushing cross-language detection close to the 90s. The precision of 91.8% is particularly important – it means our model makes few mistakes when declaring a Java and Python function as clones. Many previous methods struggled with false positives due to coincidental similarities (e.g., using similar library calls for different tasks, which might confuse simpler models). The unified structural approach seems to mitigate that. The recall 88.1% indicates some clones are still missed, but it’s a big step up from 77% recall of C4 or 81% of CodeBERT. Given the difficulty of cross-language clone detection reported in literature, these results are promising. They suggest our approach effectively bridges the representation gap between languages. The improvements likely stem from (a) our use of the InferCode pretrained AST encoder, which learned language-agnostic patterns, and (b) the contrastive training with hard negatives, which forced better discrimination. We also note that our model’s performance here is in line with its single-language performance; the drop in F1 from Java-only ( 94.5) to cross-language ( 89.9) is about 4.6 points, whereas many baseline techniques drop 10–15 points between single and cross-language scenarios.
In summary, across Experiments 1–3, our model consistently achieved top performance. It outperforms state-of-the-art baselines in monolingual clone detection for Java and Python, and substantially so in cross-language clone detection. The gains in precision and recall demonstrate the method’s effectiveness in different contexts. These improvements mainly stem from the unified AST-based representation and the contrastive Siamese learning, which enable the model to capture deep structural and semantic similarities beyond surface syntax. However, we also observe that performance drops slightly when code pairs involve highly language-specific idioms, suggesting that future work could further enhance semantic alignment across more diverse languages. Next, we explore additional experiments to deepen our understanding of the model’s behavior.
Ablation study
We perform an ablation study to assess the contribution of key components of our approach: (A) the unified AST representation, (B) the attention mechanism in the encoder, and (C) the contrastive Siamese training objective. We create three ablated variants of our model, each missing one of these components, and evaluate them on the BigCloneBench (Java) dataset and the cross-language dataset (Java–Python). Specifically:
No-Unify: In this variant, we remove the unified AST mapping. Instead, we train separate embeddings for each language’s raw AST (like two encoders that do not share weights). This means the model sees language-specific AST node types. We still use the Siamese network but without forcing unified representation.
No-Attention: Here we disable the attention weights in the tree encoder. The AST node vectors are computed by simple averaging of child vectors (or concatenation + linear layer as in (1) without attention reweighting). This tests whether the attention mechanism is improving the encoding or not.
No-Contrastive (Binary classification): Instead of using the contrastive loss on embedding distance, we train a binary classifier on top of the Siamese outputs with a binary cross-entropy loss. In practice, we feed into an MLP classifier that outputs clone vs not. This is a more traditional approach used in some earlier works. We keep everything else the same but optimize for classification accuracy rather than pulling embeddings.
Table 6. Ablation results on BigCloneBench (Java) and Cross-Language dataset
Model Variant | BCB Java F1 | Cross-Lang F1 |
|---|---|---|
Full Model (Ours) | 94.5 | 89.9 |
– No Unified AST | 88.7 | 72.4 |
– No Attention | 92.1 | 85.5 |
– No Contrastive (BCE loss) | 91.0 | 83.7 |
The ablation reveals several insights. Removing the unified AST (training language-specific encoders) causes a significant drop, especially in cross-language performance: F1 plunges from 89.9 to 72.4. This confirms that the unified representation is critical for cross-language clone detection – without it, the model probably learns separate feature spaces and fails to align Java with Python effectively (72.4 F1 likely means it’s only marginally better than chance on cross-language). Even on the Java-only task, not unifying had an effect (94.5 -> 88.7 F1); this is interesting since unified vs separate shouldn’t affect monolingual if done properly. The drop suggests that having a unified approach even helps single-language detection, possibly because it doubles training data effectively (the model parameters are trained on both Java and Python clones, learning more robust features that generalize back to Java). In No-Unify, each encoder only saw half the data effectively. This could explain the gap.
Removing the attention mechanism (No-Attention) yields a smaller degradation: Java F1 from 94.5 to 92.1, and cross-lang F1 from 89.9 to 85.5. The model still works well but not as well, which indicates the attention adds value. Likely the attention helps focus on critical parts of the code (like matching key operations or function calls) and ignore clutter. Without it, some subtle differences might confuse the model or it might give equal weight to less important AST branches. The performance is still high though, meaning the core approach (unified AST + Siamese) is strong even without attention.
Switching to a binary classification loss (No-Contrastive) also hurts performance: Java F1 drops to 91.0, cross-lang to 83.7. The contrastive objective evidently helped push the embeddings into a space where similarity is more meaningful and easier to threshold. With BCE, the model likely had to find a decision boundary in a high-dimensional space – it achieved good results but not as tight clustering of clones vs non-clones, leading to lower precision/recall. We noticed during training that the contrastive model converged faster and produced more stable similarity scores, whereas the binary classifier was more prone to overfitting unless carefully regularized. The results validate our design choice of using contrastive Siamese training.
In summary, the ablation confirms that each component – unified representation, attention, and contrastive learning – plays an important role, particularly for the cross-language capability. The unified AST is absolutely essential for multi-language clone detection, attention and contrastive learning provide notable but smaller gains. Therefore, our full model’s high performance indeed comes from the synergy of these design elements.
Statistical Significance Analysis. To confirm the reliability of the observed performance gains, we conducted paired significance tests comparing our model with the strongest baselines (CodeBERT and C4) on F1-scores across all datasets. Both paired t-tests and Wilcoxon signed-rank tests indicate that the improvements are statistically significant at p < 0.01, confirming that the performance gains are unlikely to be due to random variations.
Scalability Evaluation. To evaluate scalability, we measured runtime and memory consumption on the largest dataset (BigCloneBench). Embedding computation required approximately 2.4 ms per function on average, and FAISS-based nearest-neighbor search took 3.1 ms per query with 256-dimensional embeddings. For a repository containing 100k functions, the full reuse proportion computation completed in under 10 minutes on a single GPU and 32 GB RAM. These results demonstrate that the proposed approach is practical for large-scale software repositories.
Table 7. Effect of similarity threshold T on cross-language clone detection performance
Threshold T (max distance) | Precision (%) | Recall (%) | F1 (%) |
|---|---|---|---|
0.3 (strict) | 97.8 | 70.4 | 81.7 |
0.4 | 95.0 | 81.5 | 87.7 |
0.5 (optimal) | 91.8 | 88.1 | 89.9 |
0.6 | 87.2 | 92.7 | 89.8 |
0.7 (lenient) | 80.5 | 96.0 | 87.6 |
Table 8. Recall (%) by clone Type on BigCloneBench (Java)
Clone Type | Our Model Recall | NiCad Recall | Deckard Recall |
|---|---|---|---|
Type I (Exact) | 100.0 | 99.5 | 98.0 |
Type II (Renamed) | 98.6 | 95.2 | 90.1 |
Type III (Near-miss) | 91.4 | 60.3 | 75.4 |
Type IV (Semantic) | 85.7 | 45.8 | 50.0 |
(Precisions were all >90% for our model in these subsets, hence omitted)
Sensitivity to similarity threshold
Our clone detection involves a threshold T on the embedding distance (or equivalently a threshold on similarity) to decide if two fragments are clones. While we set based on validation, it’s useful to see how sensitive results are to this choice and how precision/recall trade-off can be managed by adjusting T. We conduct an experiment varying T around the chosen value and measuring precision and recall on one dataset (we use the cross-language dataset as it’s the most challenging) (Table 7).
As expected, a stricter threshold (0.3) yields very high precision (97.8%) but recall drops to 70.4%. This means hardly any false positives (only the most similar pairs are flagged), but many true clones are missed because their distance might be>0.3. On the other extreme, a lenient threshold 0.7 gives recall 96.0% (almost all clones found) but precision falls to 80.5% (more false positives). The threshold of 0.5 we picked is roughly balancing: at 91.8% precision and 88.1% recall, it’s near the optimal F1. We see that 0.6 yields virtually the same F1 (89.8) but with a bias toward recall. So one could choose threshold depending on application: if missing a clone is worse (e.g., plagiarism checking might want high recall), they might go 0.6 or 0.7 to catch more, accepting some FP that could be manually filtered. If false alarms are very undesirable (e.g., in automated refactoring suggestions maybe want high precision), then 0.4 yields still decent F1 but with>95 precision. Our model thus provides a tunable behavior.
The fact that F1 is high and fairly stable around 0.5–0.6 indicates the model’s similarity scores separate clones and non-clones quite well. The range 0.4–0.6 covers nearly the maxima. At 0.5 we got slightly better F1 than at 0.6 by a hair. Also note, at 0.5, precision and recall are quite balanced (91.8 vs 88.1). At 0.6 they inverted (87.2 vs 92.7). So threshold can shift what you favor.
We also examined the distribution of distances: there is a clear gap often around 0.55 beyond which non-clones dominate, which is why these threshold adjustments behave somewhat predictably. This analysis gives confidence that the threshold choice was sound and can be adjusted per use-case.
Performance by clone type and language
In this section, we analyze the performance of our model across different clone categories and languages to identify any specific strengths or weaknesses. BigCloneBench provides labels for clone types I–IV, so we measure our model’s recall for each type. We also examine if there is any bias in cross-language direction: e.g., detecting Java-to-Python clones vs Python-to-Java (our approach is symmetric, but we can still ensure it’s equal).
Clone-Type Analysis (BigCloneBench): We computed recall on a set of known clone pairs categorized by type in BigCloneBench. Precision is high across the board so we focus on recall here. Table 8 shows recall by clone type for our model versus a couple of baselines (NiCad for Type I–III as a classical tool, and a semantic tool like Deckard for IV).
Our model perfectly or near-perfectly detects Type I clones (which is expected – those are trivial as they look identical). Type II with renaming is also almost fully caught (98.6% recall; a few might be missed if extreme identifier changes confounded even our AST in rare cases). NiCad as a text-based approach is strong on Type I/II but falls sharply on Type III (60.3%). Deckard does better on Type III (75.4%) but still far from our 91.4%. Type III clones include added/deleted lines or statements – our high recall indicates the model can tolerate quite a bit of modification, thanks to the AST structure and learning similar subtrees. For Type IV, which are purely semantic (no significant lexical similarity), our model’s recall is 85.7%. This is very high considering Deckard found only half of them. It demonstrates the model’s ability to capture semantic equivalence, likely due to having seen many training examples and the embedding capturing functionality. Some Type IV clones in BCB are quite difficult (different algorithms solving same problem) – missing 14% might be those extreme cases where even semantics diverge or the code length is small. This breakdown shows our approach handles all clone types well, with only a modest drop for the hardest Type IV cases.
Cross-language directionality
Since our training and model are symmetric, it should not matter which code is first or second. We verified this by flipping every pair in the cross-language test and re-evaluating. The results were identical (as expected, Euclidean distance is symmetric). However, we also tried training a variant of our model where we did not enforce shared weights (like separate encoders but still Siamese architecture). That model had direction-specific differences (embedding of Java vs Python not comparable). Our unified approach avoids that entirely.
We also looked at whether certain language-specific constructs are harder to match. For example, list comprehensions in Python vs loops in Java, or Java’s static typing (like verbose declarations) vs Python’s dynamic code. Our model successfully matched code in these scenarios in most cases. It sometimes had slightly lower similarity scores for pairs where one side used a library call and the other wrote manual code (like Java using Collections.sort() vs Python implementing a sort manually). In a couple of such cases, it still flagged them as clones (because overall structure of sorting was recognized). But if one code heavily uses built-in functions and the other writes low-level code, it’s challenging. This is an inherent semantic gap not just language gap. Possibly incorporating API semantics explicitly (as CLCDSA did with documentation) could further help, but that’s future work.
Multi-language projects
We briefly tested on a small set of cases where a single project had both Java and Python implementations (not in XLCoST, but e.g. a known algorithm imple-mented separately). Our model detected those clones, which is encouraging for real-world multi-language codebases.
Overall, our model’s consistent performance across clone categories and languages underlines its robustness. It notably excels in the semantic clone (Type IV) arena relative to prior tools, which is a major advance given the field’s long-standing difficulty with Type IV clones.
Case study: code reuse proportion in an open-source project
Finally, we illustrate our model’s capability to measure code reuse proportion with a brief case study. We took an open-source project where we suspect a portion of code was copied from another library. Specifically, we use a simplified scenario: a project X (written in Java) that included some utility functions directly copied from Apache Commons Lang library. We know by inspection that about two functions (out of 10) were verbatim copied. We run our reuse proportion detection (Algorithm 1) on project X with repository R being the Apache Commons codebase.
The output identified those two functions as clones with reuse ratios of 100% each, and a third function that was partially similar (it had a large snippet from a StackOverflow answer – which happened to appear in R as well). Summing up, the tool reported 24% of project X’s code is reused. Indeed, the two copied functions accounted for roughly a quarter of the total LOC. This matches the known ground truth. In another test, we looked at a student’s Python assignment that was known to borrow code from an online example. Our method flagged the borrowed functions and reported 35% reuse. The instructor’s manual estimate was “about one-third plagiarized”, so again it aligned.
While anecdotal, these examples show the practicality of the reuse proportion metric. It gives an easy-to-understand number (e.g., “35% of this code is likely reused from known sources”) which can augment a clone detection report. In scenarios of plagiarism adjudication or license auditing, such a percentage can quickly communicate the extent of copying. Our method also highlights which parts (via marking cloned tokens), so one can review the content. We note the accuracy of proportion depends on how comprehensive the repository R is: if R doesn’t contain the true source of copied code, our model might not find it, underestimating reuse. Conversely, false positives could overestimate it. But given our high precision, overestimation is less likely. Ensuring R is broad (like containing popular libraries, prior student submissions, etc.) will maximize the utility of this feature.
In summary, the experiments confirm that our proposed method not only achieves superior clone detection performance but also provides a meaningful extension to quantify reuse. The results and analyses suggest that the approach is robust, generalizes across languages, and can be applied in practical settings such as plagiarism detection tools or software maintenance systems.
Beyond the experimental validation, our findings have important practical implications. In software engineering practice, the proposed method can be used to monitor and manage code reuse across large multi-language projects, enabling developers to identify redundant or cloned functions that may increase maintenance costs. The quantitative reuse ratio can also assist project managers in evaluating code originality and modularization quality. In academic settings, the approach provides a principled means for detecting cross-language plagiarism, where students translate code between languages to evade detection. Additionally, when applied to large-scale software repositories, scalability challenges may arise due to the volume of function embeddings and the cost of similarity searches. Future work could integrate distributed vector indexing and incremental embedding updates to ensure efficiency in industrial-scale systems.
Conclusions
In this work, we presented a novel approach for code reuse proportion detection that tackles the problem of identifying and quantifying code clones across different programming languages. Our method introduced a unified abstract syntax tree representation and a Siamese deep learning model to effectively capture cross-language code similarity. Through extensive experiments on Java and Python code datasets, including the BigCloneBench and a cross-language clone benchmark, we demonstrated that our approach achieves state-of-the-art accuracy in clone detection. Notably, it outperforms existing methods by a significant margin (improving F1-score by up to 15% in cross-language scenarios) while maintaining very high precision and recall. We showed that our model is capable of detecting not only simple and syntactic clones (Types I–III) but also semantic clones (Type IV) with high recall, even when they appear in different languages. Moreover, we extended the model’s output to compute the proportion of code that is reused, providing a quantitative measure of cloning in software. This was illustrated in case studies, highlighting the model’s potential application in plagiarism detection and software quality assessment – for example, flagging when a large fraction of a program’s code is copied from elsewhere. All preprocessing scripts, model configurations, and training parameters have been made available to facilitate reproducibility and independent verification.
Our contributions include the unified AST framework for multi-language analysis and the integration of contrastive learning and attention mechanisms for clone detection. The ablation studies confirmed that these innovations collectively enhanced the model’s performance. We believe this work opens up new possibilities for more transparent and insightful clone analysis: instead of just a binary clone report, developers or educators can now get an estimate like “30% of this code is cloned” and know exactly which parts contribute to that percentage.
There are several avenues for future work. One direction is to expand the approach to more languages and domains. Our unified representation can be extended to other programming languages (such as C++, JavaScript, etc.) by defining analogous mappings. Preliminary results are promising, but each language has unique constructs and parsing challenges that may require additional handling or retraining. Another direction is improving the reuse proportion estimation by incorporating weighting – for instance, weighting critical sections more heavily or differentiating between direct copy-paste vs. reimplementations of known algorithms. Additionally, integrating code clone detection with code provenance tracking could help not just measure reuse but also trace it to its origin (e.g., linking a cloned fragment back to the library or post it came from). This could be achieved by enriching the repository with metadata and having the model output source identifiers for clones.
From a deployment perspective, optimizing our model for large-scale use is important. While the Siamese network is efficient, indexing and searching millions of code embeddings for reuse analysis might require further optimization or a two-stage approach (coarse filtering by hashes, then fine neural verification). Techniques from information retrieval could be combined with our deep learning approach to scale to huge code repositories such as GitHub.
Finally, we aim to explore the interpretability of the model’s clone decisions. One critique of deep learning models is their “black box” nature. By analyzing attention weights and the AST substructures that align between two codes, we could provide insights into why the model labeled them as clones (e.g., “both code fragments have a similar loop and sorting logic”). This explanatory capability would greatly enhance user trust and usefulness in educational settings for plagiarism detection.
In conclusion, this paper contributes a powerful and innovative solution for cross-language clone detection and reuse quantification. Our results indicate a step forward towards automating the detection of software redundancy and plagiarism with high accuracy. We envision that these techniques can be integrated into development tools and academic honesty systems to improve code review, maintenance, and plagiarism checking processes. By identifying and measuring code reuse, we not only catch undesirable cloning (bugs, license violations, cheating) but can also highlight opportunities for refactoring and knowledge sharing in software projects. We also acknowledge the limitations inherent in cross-language representation and dataset balance, and we provide open resources to encourage reproducibility and future benchmarking. We hope our work spurs further research into multi-language code analysis and brings us closer to automated, language-agnostic understanding of software code reuse.
Author Contributions
Yi Rong: Methodology, Writing–original draft, Software, Formal analysis, Resources. Yan Zhou: Methodology, Supervision, Resources, Funding acquisition, Writing–review & editing.
Data Availability
The datasets used in this manuscript are publicly available datasets. Detailed information about these datasets is provided in Section Dataset of this manuscript.
Declarations
Competing Interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
Abid S, Cai X, Jiang L (2023) Interpreting codebert for semantic code clone detection. In 2023 30th Asia-Pacific Software Engineering Conference (APSEC) (pp 229–238). IEEE
Baker B S (1995) On finding duplication and near-duplication in large software systems. In: Proceedings of 2nd working conference on reverse engineering (pp 86–95). IEEE
Bui N D, Yu Y, Jiang L (2021) Infercode: Self-supervised learning of code representations by predicting subtrees. In: 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (pp 1186–1197). IEEE
Cheng, X; Peng, Z; Jiang, L; Zhong, H; Yu, H; Zhao, J. Clcminer: detecting cross-language clones without intermediates. IEICE Trans Inf Syst; 2017; 100, pp. 273-284. [DOI: https://dx.doi.org/10.1587/transinf.2016EDP7334]
Chilowicz M, Duris E, Roussel G (2009) Syntax tree fingerprinting for source code similarity detection. In: 2009 IEEE 17th international conference on program comprehension (pp 243–247). IEEE
Cordy J R, Roy C K (2011) The nicad clone detector. In: 2011 IEEE 19th international conference on program comprehension (pp 219–220). IEEE
Fang, Y; Zhou, F; Xu, Y; Liu, Z. Tcccd: Triplet-based cross-language code clone detection. Appl Sci; 2023; 13, 12084. [DOI: https://dx.doi.org/10.3390/app132112084]
Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, et al (2020) Codebert: A pre-trained model for programming and natural languages. arXiv:2002.08155,
Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Svyatkovskiy A, Fu S, et al (2020) Graphcodebert: Pre-training code representations with data flow. arXiv:2009.08366,
Jiang L, Misherghi G, Su Z, Glondu S (2007) Deckard: Scalable and accurate tree-based detection of code clones. In: 29th International Conference on Software Engineering (ICSE’07) (pp 96–105). IEEE
Juergens E, Deissenboeck F, Hummel B, Wagner S (2009) Do code clones matter? In: 2009 IEEE 31st International Conference on Software Engineering (pp 485–495). IEEE
Kamiya, T; Kusumoto, S; Inoue, K. Ccfinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Software Eng; 2002; 28, pp. 654-670. [DOI: https://dx.doi.org/10.1109/TSE.2002.1019480]
Keivanloo I, Rilling J, Charland P (2011) Internet-scale real-time code clone search via multi-level indexing. In: 2011 18th Working Conference on Reverse Engineering (pp 23–27). IEEE
Kochhar P S, Wijedasa D, Lo D (2016) A large scale study of multiple programming languages and code quality. In: 2016 IEEE 23Rd international conference on software analysis, evolution, and reengineering (SANER) (pp 563–573). IEEE volume 1
Koschke R (2007) Survey of research on software clones,
Krinke J (2007) A study of consistent and inconsistent changes to code clones. In: 14th working conference on reverse engineering (WCRE 2007) (pp 170–178). IEEE
Lei M, Li H, Li J, Aundhkar N, Kim D-K (2022) Deep learning application on code clone detection: A review of current knowledge. J Syst Softw 184:111141
Li L, Feng H, Zhuang W, Meng N, Ryder B (2017) Cclearner: A deep learning-based clone detection approach. In: 2017 IEEE international conference on software maintenance and evolution (ICSME) (pp 249–260). IEEE
Meng, Y; Liu, L. [retracted] a deep learning approach for a source code detection model using self-attention. Complexity; 2020; 2020, 5027198. [DOI: https://dx.doi.org/10.1155/2020/5027198]
Mou L, Li G, Zhang L, Wang T, Jin Z (2016) Convolutional neural networks over tree structures for programming language processing. In: Proceedings of the AAAI conference on artificial intelligence. volume 30
Moumoula M B, Kaboré A K, Klein J, Bissyandé T F (2024) Large language models for cross-language code clone detection. CoRR,
Nafi K W, Kar T S, Roy B, Roy C K, Schneider K A (2019) Clcdsa: cross language code clone detection using syntactical features and api documentation. In: 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) (pp 1026–1037). IEEE
Perez D, Chiba S (2019) Cross-language clone detection by learning over abstract syntax trees. In: 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR) (pp 518–528). IEEE
Pham N H, Nguyen H A, Nguyen T T, Al-Kofahi J M, Nguyen T N (2009) Complete and accurate clone detection in graph-based models. In: 2009 IEEE 31st International Conference on Software Engineering (pp 276–286). IEEE
Prechelt, L; Malpohl, G; Philippsen, M et al. Finding plagiarisms among a set of programs with jplag. J Univers Comput Sci; 2002; 8, 1016.
Rabin M O (1981) Fingerprinting by random polynomials. Ph.D. thesis Cambridge, MA, USA
Rattan, D; Bhatia, R; Singh, M. Software clone detection: A systematic review. Inf Softw Technol; 2013; 55, pp. 1165-1199. [DOI: https://dx.doi.org/10.1016/j.infsof.2013.01.008]
Roy, CK; Cordy, JR. A survey on software clone detection research. Queen’s School Comput TR; 2007; 541, pp. 64-68.
Roy C K, Cordy J R (2008) Nicad: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In: 2008 16th iEEE international conference on program comprehension (pp 172–181). IEEE
Saini V, Farmahinifarahani F, Lu Y, Baldi P, Lopes C V (2018) Oreo: Detection of clones in the twilight zone. In: Proceedings of the 2018 26th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering (pp 354–365)
Sajnani H, Saini V, Svajlenko J, Roy C K, Lopes C V (2016) Sourcerercc: Scaling code clone detection to big-code. In: Proceedings of the 38th international conference on software engineering (pp 1157–1168)
Svajlenko J, Roy C K (2015) Evaluating clone detection tools with bigclonebench. In: 2015 IEEE international conference on software maintenance and evolution (ICSME) (pp 131–140). IEEE
Svyatkovskiy A, Deng S K, Fu S, Sundaresan N (2020) Intellicode compose: Code generation using transformer. In: Proceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering (pp 1433–1443)
Tao C, Zhan Q, Hu X, Xia X (2022) C4: Contrastive cross-language code clone detection. In: Proceedings of the 30th IEEE/ACM international conference on program comprehension (pp 413–424)
Vislavski T, Rakić G, Cardozo N, Budimac Z (2018) Licca: A tool for cross-language clone detection. In: 2018 IEEE 25th international conference on software analysis, evolution and reengineering (SANER) (pp 512–516). IEEE
Wan Z, Xie C, Zeng Y, Hu Y (2025) Code clone detection based on semantic images. Available at SSRN 5333788,
Wang W, Li G, Ma B, Xia X, Jin Z (2020) Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) (pp 261–271). IEEE
Wang Y, Wang W, Joty S, Hoi S C (2021) Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv:2109.00859,
Wei H, Li M (2017) Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In: IJCAI (pp 3034–3040)
White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. In: Proceedings of the 31st IEEE/ACM international conference on automated software engineering (pp 87–98)
Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A novel neural source code representation based on abstract syntax tree. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) (pp 783–794). IEEE
Zhang Z, Saber T (2025) Ast-enhanced or ast-overloaded? the surprising impact of hybrid graph representations on code clone detection. arXiv:2506.14470,
Zhao G, Huang J (2018) Deepsim: deep learning code functional similarity. In: Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering (pp 141–151)
Zhu M, Jain A, Suresh K, Ravindran R, Tipirneni S, Reddy C K (2022) Xlcost: A benchmark dataset for cross-lingual code intelligence. arXiv:2206.08474,
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.