Content area

Abstract

Code reuse through cloning is common in software development, yet excessive or unchecked cloning can harm maintainability and raise plagiarism concerns. Detecting the proportion of reused (cloned) code in a software project, especially across different programming languages, is a challenging task. This paper defines code reuse proportion detection as measuring how much code in a target program is cloned (identical or similar) from elsewhere. Existing code clone detection techniques perform well in single-language settings but struggle with cross-language clones and do not directly quantify reuse proportion. To address these gaps, we propose a novel cross-language function-level code clone detection approach using a dual embedding Siamese neural network. Our method represents code in Java and Python using a unified abstract syntax structure and semantic embeddings, then uses a Siamese deep network to learn language-agnostic similarities. We also introduce a metric to quantify the clone-based reuse ratio for each function or program. Experiments on three public datasets (including a Java clone benchmark, a Python code clone corpus, and a cross-language Java–Python clone dataset) show that our approach outperforms ten baseline methods, including state-of-the-art and classical clone detectors. Ablation studies confirm the contribution of each component (structural embeddings, cross-language alignment, and contrastive learning) to performance gains. Our model achieves new state-of-the-art accuracy in code clone detection, enabling precise measurement of code reuse. These results demonstrate that the proposed approach can effectively detect cross-language code clones and quantify reuse proportion, benefiting software plagiarism detection and code quality assessment in multi-language projects.

Details

1009240
Title
Quantifying cross-language code reuse via function-level clone detection
Author
Rong, Yi 1 ; Zhou, Yan 2 

 The University of New South Wales, School of Education, New South Wales, Australia (GRID:grid.1005.4) (ISNI:0000 0004 4902 0432) 
 South China Agricultural University, College of Mathematics and Informatics, Guangdong, China (GRID:grid.20561.30) (ISNI:0000 0000 9546 5767) 
Volume
37
Issue
10
Pages
327
Publication year
2025
Publication date
Dec 2025
Publisher
Springer Nature B.V.
Place of publication
Amsterdam
Country of publication
Netherlands
Publication subject
e-ISSN
13191578
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-11-20
Milestone dates
2025-10-25 (Registration); 2025-09-12 (Received); 2025-10-24 (Accepted)
Publication history
 
 
   First posting date
20 Nov 2025
ProQuest document ID
3274025682
Document URL
https://www.proquest.com/scholarly-journals/quantifying-cross-language-code-reuse-via/docview/3274025682/se-2?accountid=208611
Copyright
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-11-22
Database
ProQuest One Academic