Comparative analysis of text-based plagiarism

Full text

Turn on search term navigation

Introduction

The ease with which knowledge may be shared through global interactive communication platforms has prompted writers to conduct targeted online searches for information [1]. This talent has had a negative effect in that individuals have attempted to take credit for their work by copying ideas or research without giving due credit [2]. It has been noted, especially in the scholarly community. One of the most significant issues is plagiarism detection, which has several focuses, including text mining, academic literature standards, and natural language processing (NLP) [3]. There are still a lot of unsolved problems about borderline sets and standards. To provide the numerous advantages of plagiarism detection, some are rather simple, while others need the application of complex algorithms and scientific ideas [4]. The prevalence of plagiarism in academic settings has increased due to its discovery in several student works, such as papers, assignments, projects, and more. Academic plagiarism is defined as using ideas, terminology, or structures without properly citing the source [5]. While there are differences in the ways that students approach plagiarism, the most extreme cases involve the complete rewriting of original information [6]. Additional strategies include rewording material via online paraphrasing services, replacing terms with synonyms, and partially rephrasing text through changes to grammatical structures [7]. Academic plagiarism is one of the most serious forms of misbehavior since it compromises the acquisition and evaluation of competencies, in violation of ethical standards [8].

Plagiarism Detection (PD) is one of the most important problems in text analysis. Its goal is to locate instances of illegal duplication or content infringement within a single document [9]. The design of plagiarism detection is predicated on the notion of analyzing a single document independently, without the need to compare it to additional sources or documents [10,11]. Although plagiarism by researchers in papers and by students in projects is not a new problem, it has become more problematic due to the ease with which information may be “copied and pasted” from journals and other online sources [12]. Although using other people’s content can be intentional, it is usually an unintentional error. publishers, instructors, examiners, and others can identify plagiarism more quickly and readily with the software programs already on the market [13]. Additionally, they can give up relying on their ability to spot parallels in previously released content [14]. Plagiarism detection software can assess student and authored papers in a matter of minutes by comparing them to previously published works.

To ensure text authenticity across a range of applications, Open AI classifier tools have recently come to be depended upon for differentiating between content generated by AI and writing done by humans. For example, Open AI, the company that created ChatGPT, unveiled an AI text classifier that helps users distinguish between essays written by humans and those written by AI [15]. Based on the possibility that a document is artificial intelligence (AI) generated, this classifier divides texts into five categories: very unlikely, unlikely, ambiguous, probably, and likely AI-generated [16]. Although not all forms of human written text are included in the training data, the OpenAI classifier has been trained on a wide variety of texts. Furthermore, the tests conducted by the developers reveal that the classifier incorrectly identifies 26% of the AI-written text as “likely AI-generated” while accurately labelling 9% of the human-written content as AI-generated [17]. Therefore, rather than depending solely on the classifier’s results to determine AI-generated material, Open AI recommends users consider the results as supplemental information. Writer.com’s AI content detector is one of the several AI text classifier tools available. It highlights the usefulness of AI-generated content for content marketing and provides a restricted application programming interface API-based solution for recognizing it. With a 99% accuracy rate, Copyleaks is an AI content detection system that integrates with numerous learning management systems and APIs. The authors Emi, Bradley, and Spero, Max developed GPTZero [18], an OpenAI classifier tool designed to identify AI-generated content in student submissions and prevent AI plagiarism in educational settings.

Selecting the most trustworthy and efficient plagiarism detection system these days can be challenging due to the abundance of options [19]. Consequently, this paper presents the results of a survey-based study for selecting an efficient academic plagiarism detection method. Plagiarism checkers are successful in identifying extrinsic plagiarism each year, while plagiarism only depends on stylometric functionality that is examined by using the arrangement of the papers, according to a quantitative research study on a variety of plagiarism types across a broad range of plagiarised scripts [20]. A thorough analysis of the classification of plagiarism-checking methods was conducted, concentrating on textual characteristics, organizational characteristics, semantic structures, candidate-information extraction prototypes, and plagiarism-finding procedures. Concept plagiarism appears in the downstream hierarchical structure of clever plagiarism kinds [21] because it lacks the textual semantics required to transmit the concept and the format’s localization of perspective. The primary objective of this survey is to comprehensively review and evaluate the state of the art in terms of plagiarism detection techniques [22,23]. This survey also aims to clarify the difficulties that are unique to low-resource languages and suggest possible lines of inquiry for further research in plagiarism detection, which will ultimately improve plagiarism detection techniques and lessen the likelihood of illegal content replication [24].

Literature review

Literature reviews have become a vital tool for thoroughly investigating and understanding a range of topics. Numerous perceptive review studies have been carried out in a variety of domains [25], shedding light on important issues [26]. These reviews frequently fit into different categories, such as meta-reviews or mapping studies [27], narrative or conventional reviews [28], and systematic literature reviews [29]. A review of the literature is provided in this study, which explores the topic of plagiarism detection and offers a fresh viewpoint and analysis on the subject. A noteworthy advancement in this subject occurred in the middle of 2019 with the release of the GROVER model [30], which can both create and identify fake news. GROVER, which surpassed previous deep pre-trained models, boasted a 92% accuracy rate in identification and had access to 5000 of its own created articles in addition to infinite genuine news items. The Giant Language Model Test Room, or GLTR tool [31] was made available in June 2019. The open-source tool GLTR finds generation artifacts from different sampling methods used in language models by looking at texts made by models like Generative Pre-trained Transformer-2 (GPT-2) using some different baseline statistical methods. Subsequently that year, OpenAI released a customized GPT-2 detector [32], which improved the Roberta model [33].

In a study, [12] the authors Altheneyan and Menai, conducted a critical analysis of the methods currently used for paraphrase detection and automated plagiarism detection results. It explained the categories of strange events, the fundamental approaches, and the sets of characteristics that each method made use of. It evaluated and investigated how well plagiarism-detection methods that recognize paraphrases in benchmark corpora perform. The main discovery was that word overlapping and structural interpretations are feature subcategories that help support vector machine (SVM) in paraphrase identification and plagiarism detection in corpora produces the best presentation outputs. Deep learning techniques are the most interesting area of research in this discipline, according to a study [34] on their effectiveness. A unique model for creating and identifying fraudulent internet reviews was created in 2020 [35]. This novel strategy used a fine-tuned BERT model as a classifier for the detection phase, combining the review generation capabilities of GPT-2. In a study [36], the authors Uchendu et al. investigated the difficulty of differentiating human-written texts from those produced by neural network-based language models in the same year. Their focus was on three different authorship attribution problems: determining whether two texts were generated by the same algorithm, determining whether a piece was written by a machine or by a human, and identifying the exact neural algorithm that generated a given text. To conduct their empirical studies, they used writings produced by eight different models Conditional Transformer Language Model (CTRL) [37], Cross-lingual Language Model (XLM) [38], Generalized Autoregressive Pretraining for Language Understanding (XLNET) [39], and Plug and Play Language Model (PPLM) [40] as well as texts authored by people. The study found that while the majority of text generators continue to provide content that can be distinguished from Human-Written Text (HWT), other models such as GPT2, GROVER, and FAIR deliver higher-quality outputs that cause machine classifiers to become confused more often.

As part of the “Tweep Fake” project [41,42], a detector was created by the authors Fagni et al. and Jawahar et al. in 2021 to detect deepfake tweets. The first authentic dataset of deepfake tweets on Twitter was created as part of this project. The research team collected and analyzed tweets from 23 bots that imitated 17 real user accounts using a range of content-generating techniques, including GPT-2, RNN, LSTM, and others. In 2022, a different study [43] revealed a neural network-based detector that combines textual data with explicit factual information. This is made possible via entity-relation graphs, which are captured in the text as interactions between different entities and relationships and are encoded by a graph convolutional neural network. By reasoning about the facts provided, the goal of the model is to distinguish modified news items from detectors that only use stylometric signals. In a study [44], the authors Guo et al. compare the capabilities of ChatGPT with human specialists and was one of the first studies released in 2023 that comprised datasets generated by ChatGPT and HWT and detectors trained on the same dataset in both Chinese and English. The Human ChatGPT Comparison Corpus (HC3) is the name given to the vast dataset that the researchers gathered. With the help of databases like ELI5 [45] and ChatGPT, which generates replies for these questions, it has about 40,000 questions and answers covering a wide range of topics like psychology, economics, law, and medicine.

The authors of [46] presented a revolutionary watermarking framework for Large Language Models (LLMs) in 2023. Their method embeds a watermark using the output log-likelihood of LLMs at each generation stage, mainly by using a green token list. This technique identifies and governs the outputs of these potent language models ethically by incorporating detectable signals into the generated text that are unseen to human readers. By the paradigm outlined in [46], Guo et al. proposed three significant enhancements to existing watermarking approaches through more investigation into the topic of watermarking [47]. Another important work on the topic of watermarking carried out by the authors Christ et al. in [48], takes a unique task by utilizing cryptography concepts. This novel approach, which is distinct in that it depends on cryptography, guarantees that watermarks incorporated into LLMs stay undetected unless a particular secret key is used. Details of existing plagiarism detection models are given in Table 1. Moreover, the present survey is unique in that it investigates difficulties in great detail, covering general and low-resource language-specific challenges. Among various studies, [49] has noted challenges; however, none have particularly examined the intricacies of tackling PD issues in languages with limited linguistic resources. In response to the heightened demand for upholding rigorous scholarly ethics in higher education, there is a pressing need to ensure the efficacy of plagiarism detection techniques. This study [50] article aims to examine how higher education institutions might enhance their plagiarism detection capabilities through the utilization of Artificial Intelligence (AI).

[Figure omitted. See PDF.]

The authors in a study [51], claimed that a 2023 survey examined 3,017 high school and college students. It was discovered that nearly one-third admitted to utilizing ChatGPT for homework aid. The emergence of Large Language Models (LLMs) like ChatGPT and Gemini has increased academic dishonesty. Students can now fulfill their tasks and examinations merely by requesting solutions from a language model, circumventing the necessary effort for learning. Moreover, it is concerning because instructors lack the appropriate instruments for detection. Primarily, plagiarism detection systems are employed in the education sector; however, they are also useful in other fields such as journalism and media, business, and the creative arts. The Turnitin system is designed to reduce textual similarities in articles and scientific papers. Turnitin, a prevalent plagiarism detection software, offers a mechanism to identify and mitigate plagiarism. Nonetheless, it is crucial to tailor it to the environment and particular requirements of Islamic Religious Education programs at private Islamic universities. The proposed method [52] entails the integration of Turnitin with Islamic literary databases and the modification of detection algorithms to identify religious texts. Plagiarism has become a prominent topic of discussion in higher education institutions recently. Turnitin text-matching software has been extensively used by numerous academic institutions in Ghana as a means to enhance the academic writing of students and professors, as well as to identify instances of plagiarism. Despite extensive research on attitudes, motivations, and demographic factors associated with academic dishonesty, there has been limited empirical investigation into students’ actual knowledge of plagiarism and their experiences with text-matching technology, as indicated by a study [53]. Moreover, this survey is unique since it includes a future directions component. This section provides insightful information for future research and development by imagining possible paths and developments for plagiarism detection.

Survey methodology

To stay true to the main goal of this study, which is to analyze the literature on plagiarism detection, we have compiled knowledge and recommendations from previous approaches and reported in numerous research [25–30]. By utilizing this expertise, we have created acceptable research questions, search methodologies, and clear study objectives. With this method, we efficiently look for and locate pertinent papers in the area of plagiarism detection. The study’s literature review component focused on currently accessible peer-reviewed journal papers on AI detection techniques that differentiate between writings created by AI and texts authored by humans in a variety of disciplines of study. The lack of published, peer-reviewed journal publications in its core area limited its quality assessment, and its searches were thorough but limited by a time frame.

Ethics statement

Ethical approval was not sought for the present study due to all the data is available.

Research objectives

The goal of this study was to review articles that were published between 2019 and 2024. It concentrated on how well Artificial Intelligence detection technologies work in identifying text produced by humans and AI. The study’s main focus was on the AI detection technologies used at this time in higher education. Since ChatGPT’s debut and the subsequent spread of AI-powered chatbots, determining which AI detection technologies work better and whether their detection accuracy is dependable have been some of the major issues facing the higher education industry. The research goals of this study are defined as follows:

1. To list and evaluate the most widely used feature extraction methods for plagiarism detection.

2. Examining and contrasting the most popular techniques for detecting plagiarism.

3. To demonstrate how plagiarism detection methods have changed over time.

4. To recognize and investigate the difficulties in identifying plagiarism.

Research questions

We have prepared a series of relevant research questions, each aimed at delving into distinct facets of plagiarism detection, to tackle the research objectives properly. The following research questions were considered in the study to achieve the goals of the survey:

1. What are the key feature extraction techniques most commonly used?

2. Which methods are most commonly used to detect plagiarism?

3. What does each article aim to achieve?

4. How have methods for detecting plagiarism changed over time?

5. What challenges are there, and how may they be overcome?

6. What are the survey’s primary conclusions and findings?

Research strategy

We used Google Scholar and Web of Science to conduct a keyword-based automated search [60] to gather the research publications that were part of our survey. The search was restricted to the years 2019 through 2024. We included these foundational publications, regardless of when they were published, to make sure our survey includes all pertinent source material. Online databases, academic social networking sites, and search engines were all used in the search. Different online databases (Google Scholar, PLOS, Taylor & Francis Online, ACM, ScienceDirect, Scopus, and IEEE Xplore Digital Library), two Internet search engines (Google and Microsoft Bing), and ResearchGate comprised these online search platforms. These internet resources were all readily available. Keywords, phrases, and brief sentences of the study’s target area AI detection tools for distinguishing between texts produced by AI and those written by humans were included in search strings. Depending on the search platform, the search strings included truncation symbols like *, or - and Boolean operators like AND or OR. Additionally, variations of these search strings were applied repeatedly. The following search query is used to locate pertinent publications for this investigation.

* “plagiarism” AND “detection”

* “plagiarism” AND “detection” AND “Trends”

* “analysis” AND “stylometric features”

* “Semantic” AND “stylometric features”

* “feature extraction approaches”

* “feature extraction methods”

* “style analysis”

* “grammar analysis”

* “syntax based detection”

* “challenges in plagiarism detection”

The process of selection

The research selection process is an important step in the literature survey process [61]. For this survey, 189 preliminary studies were gathered from various sources to detect plagiarism. The authors used preset inclusion and exclusion criteria to shortlist the papers during the selection process. Another author was consulted to settle any disagreements, and the inclusion/exclusion standards were improved. The traditional inclusion/exclusion format served as the foundation for the quality assessment criteria that were utilized to determine the eligibility and relevance of the journal articles for this investigation. Journal articles published between 2019 and 2024 met the time-period inclusion criteria. During the designated coverage period, a search and screening procedure was carried out on the fourteen internet search platforms described above to identify journal articles that qualified for inclusion. The web search engines returned 189 articles as a result of this procedure. Of these articles, forty did not satisfy the specified coverage time limit and were thus erased, and 34 were duplicates that were removed as well. Titles and abstracts were examined to narrow down the remaining papers.

1. Search by title: Papers that are unrelated based just on their title are carefully culled in the first step. Numerous papers that were no relevant were present. There were just 115 papers left after this step.

2. Search by Methodology Keywords: In this phase, papers that don’t relate to each other based only on their methodology section keywords rigorously culled. Eventually, only a few pertinent papers remained.

3. Abstract-based search: In this step, the papers are organized for analysis and research approaches after the abstracts of the selected articles were evaluated in the previous step. Just ninety-nine papers were remaining after this.

4. Complete text-based Evaluation: At this point, the articles selected in the previous step are evaluated for their empirical quality. The study’s text has been thoroughly analyzed. A total of forty-five papers were chosen from ninety-nine articles. Four more relevant and qualifying articles were found by the snowballing search that followed. Thus, 49 publications in all were appropriate and qualified for the current investigation.

5. Low-Quality Papers: Excluding papers that were not included in the Google Scholar database was the last step in the research selection process. Additionally, publications that were published without a DOI (Digital Object Identifier) are not included in the analysis.

Quality assessment criteria

The standard of the selected primary research was evaluated using the following standards. This quality evaluation was carried out by two writers. The study’s main empirical findings are Y-1 and N-0. The study is published in a reputable journal that is selected using the Scientific Journal Ranking (SJR) and the CORE ranking of conferences. For review studies, assessing and guaranteeing methodological quality is crucial. This is true even in cases where the number of review studies in a particular field of study is limited. The authors independently assessed each of the reviewed articles. For every article, sixteen criteria were rated as yes-1 or no-0. The agreement scores between the two raters were calculated using the parameter kappa coefficient K, as referenced in [62]. The disagreements on ratings were settled through discussion and consensus-building [63]. The scoring system [64] developed by the authors Landis et al. and its associated interpretation was used to compute the inter-rater agreement. Inter-rater agreement is a measure of how autonomous raters are when they seek to arrive at the same conclusion when scoring items. This mutual agreement score was considered acceptable because it is between 0.84 and 1.00, which is the almost perfect score range [65,66].

Results of study selection

To provide answers to the aforementioned research questions, a total of 49 papers were found and examined. Thirteen papers had a score of less than 15%, while 31 papers achieved a score of above 85%. Since they offer some helpful information and were published in reputable journals, some papers have also been included in this analysis. These studies also address significant factors of technology and demography that are directly relevant to plagiarism.

Scrutiny of survey articles

The primary vulnerabilities associated with systematic literature reviews are inadequacies in the data collection process and the content’s choice, organization, and display. To reduce the possibility of missing any important information, we mainly employed Google Scholar and Web of Science, two of the largest databases for academic literature. We queried the two databases using a multi-stage approach, where the results of each step informed the subsequent one, using keywords that we gradually improved upon to achieve the highest level of coverage. By combining all relevant references of the papers that our keyword-based search had produced, we were able to gather more documents by drawing on the knowledge of domain experts, research paper writers, and literature reviewers on the topic. We also incorporated content-based suggestions from major publishers’ digital libraries, including Elsevier and ACM.

Datasets

This survey thoroughly examines the standard datasets as given in Table 2 used to evaluate and analyze the field of plagiarism detection. In plagiarism detection, these datasets are essential since they make it easier to compare and assess different detection methods. These databases provide various textual material from many genres and sectors. Researchers can benefit from the PAN datasets https://pan.webis.de/data.html [67], which cover several years and include plagiarism detection scenarios. The Corpus of English Novels (CEN) https://github.com/computationalstylistics/100_english_novels [70] is another noteworthy dataset that provides a special collection for assessing plagiarism detection methods. These datasets work with other sources, such as Wikipedia and a variety of online papers, to improve the accuracy and resilience of plagiarism detection techniques. Their availability enables researchers to create, evaluate, and improve algorithms that can address plagiarism in real-world situations involving a variety of textual sources and styles. Finding copied passages in suspicious work based on an inconsistent writing style is the purpose of the plagiarism detection method.

[Figure omitted. See PDF.]

The intrinsic technique, in contrast to the external approach, does not need to compare the dubious material to any possible sources of plagiarism. Of the 1024 texts in the InAra corpus [75], 80% are portions that have been plagiarised to appear authentic. There is an XML file linked to every questionable document that details the length and location of every passage that contains plagiarism. This artificial healthcare dataset was made to give data science, machine learning, and data analysis enthusiasts a helpful resource [68]. The dataset such as given in [77] is notable for its wide range of texts, which include a variety of genres that are common in Urdu writing. It features yearly occasions, prominent citizens of the country, and textual moral teachings to guarantee a representative assortment of Urdu language content. This composition enables a thorough investigation of the complexities involved in plagiarism detection for various document forms. With the use of numerous stylometric parameters, the dataset painstakingly portrays the complex terrain of Urdu writing styles.

Pre-processing

Careful preprocessing procedures are needed for intrinsic plagiarism detection techniques to eliminate noise and preserve crucial data for analysis [78]. To prevent losing possibly helpful information, it is advised to keep preprocessing processes to a minimum. Preprocessing techniques such as part-of-speech tagging, lemmatization, stop-word removal, sentence segmentation, paragraph construction, tokenization, lowercase, and removal of punctuation are applied during the preprocessing step [19,79]. The objective of these procedures is to extract pertinent linguistic elements for plagiarism analysis and standardize the content [80–82]. To provide consistency in the text representation, for example, the lower casing is used to transform all characters to lowercase [83,84].

Because stop words and punctuation removal have such strong effects on word co-occurrences and, consequently, the corresponding relationships, they can significantly alter the resultant word embeddings [85]. For instance, in the phrase “The regions of Earth surrounding the Equator are known as the tropics.” taken from the English Wikipedia, look at the terms “tropics” and their contemporaries in a window size of five: The areas of Earth that surround the equator are known as the tropics. The italicized terms in this sentence were used about the word “tropics.” The sentences that follow demonstrate how eliminating punctuation and stopwords can alter the meaning of the term “tropics". The tropical areas of Earth that encircle the equator. The tropical areas of Earth encircle the equator. Although the removal of punctuation in this case has less of an impact than the removal of stopwords, we can still see that the word “Equator” is not the same as the other, indicating that punctuation can alter a word. A corpus contains a lot of punctuation, thus removing anyone can significantly alter the word’s context [84,86].

Identifying the limits of sentences or other language units inside the text is the main goal of segmentation [87–89]. To lessen the influence of particular numerical information on plagiarism analysis, numbers might be removed or replaced with placeholders. Furthermore, named entity recognition (NER) can provide important text by classifying and identifying named entities [90,91]. Eliminating common words with no semantic significance, such as “the,” “is,” and so on, is the process of removing stop words [92].

After the text has been purged of all punctuation and numerals, separated into the necessary tokens, and stop words eliminated, you can start transforming the words that remain. Eliminating affixes that is, components that are joined to the root and result in the creation of a new word is the next stage. Stemming and lemmatization [88] are two methods that can be used to accomplish this task; however, they differ greatly in terms of speed and transformation technique. Stemming is the process of removing prefixes, ends, and suffixes from a word so that the remaining portion remains the same in all incarnations of the word. Here, two issues arise under-stemming and over-stemming, which occurs when two terms with similar meanings are transcribed into two distinct forms, respectively, and two instances of different meanings are merged into one form [93]. The text dictionary can be made smaller by using one of the several stemming algorithms that have been developed to reduce words to their roots. Lovins stemming, Porter, Peisa-Huska, Dawson, HMM, and YASS are a few of these [94]. By giving each word a grammatical category (noun, verb, adjective, etc.), PoS tagging facilitates syntactic analysis and helps spot possible plagiarism by pointing out word use similarities [34,95,96]. Natural Language Processing (NLP) libraries provide reliable and practical tools to do these preparatory procedures [97–99]. In their multilingual and multi-functional text processing pipelines for plagiarism detection investigations, researchers mostly employ these libraries [100,101].

Methods for extracting features

In natural language processing (NLP), the extraction of distinctive characteristics is crucial in facilitating efficient representation and analysis of textual input. Sentiment analysis, machine translation, text summarization, and other applications benefit from its ability to capture linguistic features, semantic information, and structural patterns [102]. Deeper linguistic analysis is made possible by extracting significant characteristics, which increases natural language processing application accuracy and productivity. The process of selecting and converting unprocessed data into a set of relevant features that accurately represent and explain the data is known as feature extraction [103]. Feature extraction is a subfield of natural language processing (NLP) that specializes in obtaining relevant and significant characteristics from textual input. In this procedure, words are taken out of the text data and transformed into features that classifiers may use [104]. Feature extraction reduces the volume of data by identifying the most valuable characteristics by merging variables into components. Plagiarism detection relies heavily on feature extraction. The text is transformed into numerical representations with relevant data using a variety of techniques and methods [105].

Lexical

Finding linguistic distinctions between a language’s dialects frequently calls for in-depth human investigation and specialized knowledge. This is mostly because learning different dialects involves a great deal of intricacy and subtlety [106]. When calculating similarity, lexical identification techniques only take into account the characters present in a given text. The lexical detection techniques need to be used in conjunction with more advanced NLP techniques to identify obfuscated plagiarism [107,108]. Lexical detection techniques are also useful for detecting homoglyph substitutions, a prevalent way that technology disguises itself. Approaches to lexical detection often fall into the categories that are described in the next sections.

N. grams.

Character-level N-grams, which are N-character sequences, provide a deeper examination of the text’s structure [109,110]. On the other hand, word-level N-grams provide information about the syntactic and semantic patterns of the document by encapsulating N-word sequences. N-grams are typically represented numerically for quantitative calculations and comparisons. After representation, the N-grams are compared and contrasted to search for trends and variations [111,112]. The phrase “The cow jumps over the moon” is one example. If N=2 is referred to as bigrams, the N-grams would be:

* the cow;

* the cow leaps;

* the cow jumps over;

* over the moon.

Character n-gram comparisons can be utilized for cross-language plagiarism detection (CLPD) in cases when the languages involved such as Spanish and English have a high degree of lexical similarity [113].

Querying search engines.

Web search engines are used by many detection techniques for candidate retrieval, or the first step of the detection process, which is the identification of possible source documents. The success of this approach depends on the mechanism used to choose the query terms from the suspicious document [114] For example, finding the longest sentence in a paragraph and its keywords.

Vector space model (VSM).

The Vector Space Model has been widely used by researchers as a foundational method in plagiarism detection [10,11]. The VSM technique converts textual data into numerical vectors to represent texts and evaluate their similarity. This method makes it easier to do quantitative analysis, which makes it possible to find possible plagiarism cases. The fundamental concept of VSM states that every document is represented as a vector in a multi-dimensional space, with each dimension denoting a distinct word or term [115]. The significance or frequency of a term inside the document itself is indicated by the value of each dimension. N-gram words usually define the vector space’s dimensions in plagiarism detection, and each vector’s constituents are weighted according to the Term Frequency - Inverse Document Frequency (TF-IDF) relationship [116]. Inverse Document Frequency (IDF) values are derived from the dubious document or the corpus. The cosine measure is frequently utilized to measure the degree of similarity between vector representations; in other words, the angle that the vectors form acts as a proxy for the degree of similarity between the documents that the vectors represent. Here, lexical and grammatical traits are extracted and classified using tokens instead of strings. The similarity can be determined using a variety of vector similarity metrics, such as the Manhattan, Euclidean, Dice, Overlap, Cosine, and Jaccard coefficients [117]. It has been observed that the Cosine and Jaccard coefficients can be utilized to ascertain the degree of similarity between two vectors. Use the cosine coefficient to identify partial copying without revealing the content of the document. Thus, in situations where work submission is quite secret, this approach aids in the detection of plagiarism.

Stylometric features.

In the field of intrinsic plagiarism detection, researchers have frequently used stylometric-based feature extraction algorithms to examine the subtle differences in writing styles found in textual data [118]. To create a stylometric profile specific to every document, these characteristics are carefully measured [119]. Calculating statistical measurements such as word frequencies, average sentence lengths, or punctuation mark distributions is a common approach in stylometric feature extraction [120]. These metrics provide information on the writing style and preferences of the author.

Semantic

Semantics-based approaches work under the premise that the presence of similar semantic units in two passages determines their semantic similarity [77]. Two units are semantically comparable when they occur in similar situations [120]. The fact that units with similar contexts tend to have higher semantic similarity is the source of the concept of semantic similarity. Many techniques make use of thesauri like WordNet or EuroVoc to take advantage of semantics in the study. The performance of paraphrase identification is improved by these thesauri’s useful semantic properties, which include synonyms, hyponyms (subordinate terms), and hypernyms (superordinate terms) [111].

Latent semantic analysis (LSA).

Using a matrix that has rows for words, columns for documents, and matrix components that typically reflect log-weighted Term Frequency - Inverse Document Frequency (TF-IDF) values, Latent Semantic Analysis (LSA) computes a matrix [121,122] to measure the similarity of term distributions in texts. The term-document matrix is then approximated at a lower rank by LSA using dimensionality reduction techniques like Singular Value Decomposition (SVD). To do this, fewer rows must be used while maintaining the distribution of column similarity. The phrases that survive the dimensionality reduction are thought to be the most representative of the semantic significance of the text. Consequently, by comparing the texts’ rank-reduced matrix representations, one may determine the texts’ semantic similarity [123]. Text similarity that conventional vector space models are unable to convey. Latent Semantic Analysis (LSA)’s capacity to handle synonymy is advantageous for paraphrase recognition.

Word embeddings.

One popular approach to intrinsic plagiarism detection is word embeddings-based feature extraction, which aims to extract intricate contextual and semantic meanings of words in a document. This method carefully places related words closer together by encoding words as dense vectors inside a continuous vector space [124]. It is possible to extract important features that reflect the semantic linkages found inside words thanks to this proximity-based representation. The first step in the procedure is to train a word embedding model on a large text corpus. By carefully examining the situations in which words are used, the model learns how to encode the semantic meaning of words throughout this training phase [125]. Word2Vec or GloVe, create these embeddings using methods like co-occurrence matrix factorization or skip-gram [112].

Graph-based semantic analysis.

A text is represented by a weighted directed graph in knowledge graph analysis (KGA), where the edges in the graph denote the relationships between the semantic concepts that the text’s words communicate, and the nodes represent the semantic concepts themselves [126]. Usually, the relations come from publicly accessible corpora like WordNet or BabelNet8. The main difficulty with KGA is figuring out the edge weights. In the past, WordNet’s idea relationships were examined to determine edge weights [127,128]. The weighting process was made better by Salvador et al. [129]by utilizing continuous skip-grams that also take the concepts’ context into account. Semantic similarity scores for documents or portions of phrases are obtained by applying graph similarity metrics.

Syntax

By using PoS tagging to identify a sentence’s syntactic structure, syntax-based detection techniques normally function at the sentence level. By comparing only word pairs that correspond to the same PoS class, syntactic information might mitigate morphological ambiguity during the lemmatization or stemming step of preprocessing or reduce the workload of future semantic analysis [130]. PoS tag frequency is a stylometric parameter that is used in many detection strategies. Unlike lexical-based approaches, which focus on vocabulary and word usage, syntax-based approaches look at the syntactic structure and arrangement of sentences inside a document [131]. These characteristics can offer insightful information about the structure and content of the text, which can help spot possible plagiarism.

Syntactic.

One important aspect of intrinsic plagiarism detection is syntactic-based feature extraction, which is devoted to identifying the grammatical structure and structural blueprints of text [121]. This method delves into the complex syntax that determines sentence structure inside a document, going beyond simple lexical considerations. In addition, a variety of syntactic elements are utilized, including phrase structures and dependency interactions. For example, dependency relations record the syntactic interactions among words; they include subject-verb-object relationships and provide information on sentence structure and similarity [80]. Analyzing phrase structures both verb and noun phrases helps one comprehend how syntactic alignment works within a sentence. These characteristics work together to make it easier to compare and recognize similar or paraphrased phrases, which strengthens the detection of plagiarism that is based on syntactic tricks [132].

POS tagging.

One of the sequence labeling tasks is part-of-speech (POS) tagging, which entails giving each word a grammatical category label based on contextual and linguistic information [133]. A word’s tag or label offers details about the word and the lexical categories that surround it. A POS tagger would typically divide a sentence into several subcategories according to the parts of speech it contains, such as nouns, pronouns, adjectives, verbs, adverbs, and so on. Because they offer linguistic information on how words can be used in a phrase, sentence, or document, POS tags are useful. For many Natural Language Processing (NLP) frameworks, POS tagging is an essential preprocessing step in the language processing industry, Speech recognition, sentiment analysis, question answering, chunking, Named Entity Recognition (NER), word sense disambiguation, and more are examples of these NLP frameworks [134]. It can be difficult to determine a language’s grammatical class since it changes according to the context in which it is employed.

As a result, it can be challenging to tag every word in a phrase when some words have many grammatical POS labels. The issue of POS tagging has been extensively studied in English, several European languages, and most South Asian languages [34]. Research on Indian languages is still necessary, though, especially on the Odia language, since it can be difficult to study languages with complex morphological inflection and variable word order. Additionally, POS tagging in Odia is made more difficult by the absence of capitalization, gender information, and other elements. To determine each input word’s distinct grammatical POS, a POS tagging technique is required. The POS tagging task is approached using a variety of algorithms, including rule-based, probabilistic, deep learning, and hybrid techniques [135,136].

Plagiarism detection techniques

Traditional techniques.

Apart from the widely employed methods in intrinsic plagiarism detection, several conventional approaches have demonstrated potential in this domain.

1. Style Change Function: When detecting suspected plagiarism in a text that has unreported changes in writing style, stylistic analysis is a crucial component of the process [69]. In this paper [54] by Hourrane O, et al. (2019), the intrinsic plagiarism detection problem is approached using several embedding types. Architecture is used in two sub-tasks in this study. The first is style change detection, in which the writers check to see if an input document has sections produced by different authors because of style changes. The second method involves the writers identifying any obtrusive passages that stylistically depart from the primary writing style. This is known as style breach detection.

2. Lucene for Indexing: One popular indexing library that may be used to effectively store and retrieve textual data is Lucene, which can also be used to compare and identify passages that have been copied [137,138].

Statistical techniques.

When detecting intrinsic plagiarism, statistical and distance-based methods are frequently employed to gauge how similar or distinct two text texts are. To measure how much linguistic traits, word frequencies, or stylistic patterns coincide or diverge, these methods use a variety of statistical metrics, including hashing, character and n-gram profiles, and frequency distance.

1. Frequency Difference: It is now possible to compare text segments and identify writing style differences thanks to the usage of distance metrics like frequency difference and pq-gram distance [139]. Furthermore, these methods have demonstrated potential in the area of cross-language plagiarism detection, as they have proven successful in determining the degree of textual similarity between languages. A method [140] for identifying text alignment between the suspicious and source documents was presented by El-Rashidy et al. in 2022. Their main contribution is the term frequency-inverse sentence frequency (tf-isf) method, which is used to identify instances of plagiarism in sentences.

2. Lempel-Ziv Compression: Style change functions and Lempel-Ziv compression with their various approaches to identifying and evaluating stylistic variances in text documents, these techniques each contribute something new to the discipline. Furthermore, new methods of identifying segments with distinct writing styles and detecting plagiarism have been made possible by the addition of compression and style change tools [141].

Distance based techniques.

To identify possible cases of plagiarism, statistical and distance-based methodologies quantify stylistic differences and similarities within a manuscript.

1. Principal Component Analysis (PCA) with Distance Score: An innovative method for intrinsic plagiarism detection is presented in [142] by Veisi et al. in 2022. It suggests creating vectors of character trigram frequencies to represent the successive windows that make up a suspicious document. After that, an altered normalized distance measure is used to create the distance matrix to compare each window with the others.

2. Character N-Grams Profile Method: Bensalem et al. considerably advance the field in their study [143,144] by enhancing the method’s parameters and incorporating larger feature sets. The technique makes use of a dissimilarity metric that was initially created for author identification as well as character n-gram profiles. Furthermore, heuristic principles are put forth to help identify passages that have been copied, find texts that are free of plagiarism, and lessen the effect of unrelated style modifications.

Methods of supervised machine learning.

These methods include the use of word embeddings in embedding-based approaches, stylistic feature-focused stylometric analysis, and linguistic analysis with n-gram frequencies [145]. Machine learning algorithms facilitate the identification of potentially plagiarised passages within a manuscript as well as changes in writing style. Furthermore, by offering important insights into identifying and avoiding plagiarism in textual content, these ML algorithms significantly improve the precision and efficacy of intrinsic plagiarism detection systems [138].

1. Support Vector Machine: Three stages comprise the implementation of the proposed system [19] by the authors El-Rashidy et al., (2024). Preprocessing techniques include part-of-speech labelling, lemmatization, lower casing, stop-word removal, punctuation removal, eliminating all tokens that don’t begin with a letter, sentence segmentation, paragraph composition, and tokenization. Second, the training database is created, the support vector machine model is constructed, lexical, syntactic, and semantic features are computed, the collection of potentially plagiarised cases is extracted, and the most valuable features are selected using the seeding technique [71]. The sentence similarity cases are found using the first path, which is based on a traditional paragraph-level comparison, and the second path, which is based on the hyperplane equation of the constructed SVM classifier. In the last stage, approaches like as adaptive behavior, filter segments, filter seeds, and merging adjacent identified seeds are used to extract the best-plagiarized segment between suspicious and source texts.

2. Decision Tree: In supervised learning, Decision Trees (DTs) are predictive models that are well-known for their resilience, interpretability, and undeniable usefulness across a broad spectrum of applications. To address the three fundamental objectives of a predictive learner—fitting training data, generalization, and interoperability. This study [146] by the authors, Costa and Pedreira, (2023), offers an overview of the most important recent developments in DT research. Decision Trees (DTs) are widely used learning models. Typically, these models are shown as a structure akin to a flowchart, where each internal node represents a split or logical test, and each leaf represents a prediction.

3. K Nearest Neighbor (KNN): On the smaller dataset, KNN, SVM, and DT were the classification techniques employed by Eppa and Murali (2022) in the study [147]. The next sections go into detail about how these algorithms are implemented as well as the accuracy that can be achieved with them. The seven closest neighbors and the weights allocated depending on distance were taken into account when implementing the K Nearest Neighbours algorithm [148]. The algorithm used for the Decision Trees Classifier was based on the Gini Index.

4. Deep Learning: Plagiarism detection can be accomplished using a variety of deep learning methods. A few of them are enumerated below:

1. (a) Convolutional Neural Network (CNN): By utilizing CNNs with parameters Adaptive Moment Estimation (ADAM), Stochastic Gradient Descent (SGD), and Root Mean Square Propagation (RMSProp), we present a unique approach for authorship verification in Urdu that closes this gap. To support the development of this approach, we have put together a new corpus, called UAVC-22, that is intended for Urdu authorship verification. This corpus offers enhanced robustness concerning author and separate word classes. Word2Vec, GloVe, and FastText are three different text embedding techniques that are used. Nine authorship verification models have been developed by this study [149]. We have compared the CNN-based method [150,151] to more traditional machine learning models like SVM and RF to assess its efficacy and superiority. For the Urdu dataset UAVC-22, the CNN-ADAM model that was optimized with FastText got the greatest accuracy of 98%.

2. (b) Long Short-term Memory (LSTM): The limited resources available to low-resource text plagiarism detection provide a considerable barrier to labeled data that is available for training. The creation of complex algorithms that can recognize patterns and distinctions in texts is necessary for this endeavor, especially in the fields of translation-based plagiarism detection and semantic rewriting. We present an enhanced attentive Siamese Long Short-Term Memory (LSTM) network [115] in this work [152], which is designed to detect Tibetan-Chinese plagiarism. As part of our plan, we first introduce translation-based data augmentation to expand the bilingual practice data set. Subsequently, we offer a pre-detection approach that leverages abstract document vectors to enhance detection efficiency. Lastly, we present an enhanced Siamese LSTM network designed for detecting plagiarism in Tibetan and Chinese. We carry out extensive tests to demonstrate the efficacy of our suggested methodology for detecting plagiarism.

3. (c) Bidirectional Long Short-Term Memory Network (BiLSTM): The recommended algorithmic method for extracting information locates text lines within a given document by utilizing deep learning classification techniques that characterize the metadata of the proposed algorithmic PC. We particularly used two BiDirectional LSTM architectures [153] and a character convolutional neural network to categorize each sentence in an article as either an algorithmic metadata group or a class unrelated to algorithms. Even if word embedding-based techniques are appropriate for domain-specific tasks, structural parsers still need to manage a variety of issues, including morphological shifts and indeterminate segmentation, which makes obtaining prior knowledge costly. Text comprehension is largely tailored to a single language; numerous rules need to be created from scratch if a language changes.

Methods for unsupervised machine learning.

These methods seek to locate groups of writers that share a common writing style and find any discrepancies that might point to possible plagiarism.

1. K-Means: Only pertinent documents could be selected for detection by clustering the documents using an unsupervised machine learning method such as K-means [154]. The TFIDF text encoding strategy calculates the degree of similarity between the suspicious article and the corpus of source articles using NLP, K-means clustering, and cosine similarity approaches [155].

2. Agglomerative Hierarchical: The development of AI technology has made content creation simpler and more widely available. It is challenging to distinguish between text produced by AI and text produced by humans in such a situation. Our approach suggests an intelligent system that may use stylometric analysis to identify distinctive writing styles from text files to allay this worry. Using silhouette scores as performance indicators, this paper also examines several clustering methods [156], such as k-means, k-means++, hierarchical, and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). This guarantees that our system will be successful in differentiating between similar writing styles and those that are not based on sophisticated linguistic and structural aspects of the text. Our technologies gather text of different styles together and separate it, offering a useful way to detect plagiarism across several document files [157].

3. Density-Based Spatial Clustering of Applications with Noise (DBSCAN): Based on how it is detected, plagiarism can be divided into two main categories: intrinsic and extrinsic plagiarism. In contrast to intrinsic plagiarism detection, which uses writing style variance to identify plagiarism without the use of a reference corpus, extrinsic plagiarism detection compares a text to a predetermined reference dataset. While there are numerous methods for identifying extrinsic plagiarism, there aren’t many for identifying intrinsic plagiarism. This work [158] by Saini et al., 2021 presents a streamlined methodology for creating a plagiarism detector that can effectively identify instances of plagiarism even in the absence of a reference corpus. The method focuses on creating a plagiarism detection system by employing DBSCAN clustering [159] and stylometric features to determine the authors’ writing styles inside the article. The user-friendly interactive interface of the proposed system allows the user to upload a text document to be examined for plagiarism, and the results are shown directly on the web page. Furthermore, the user has access to graphs that represent the document’s analysis.

Evaluation techniques

Researchers in NLP and Information Retrieval (IR) need access to datasets for development and assessment. The PAN datasets are a thorough and reputable forum for comparing plagiarism detection systems and approaches [67]. The PAN test datasets include synthetic examples of cross-language plagiarism and artificially generated monolingual instances with varying degrees of obfuscation. The majority of the studies in this survey that describe algorithms for lexical, syntactic, and semantic detection make use of the Microsoft Research Paraphrase corpus or PAN datasets. Since detecting plagiarism involves retrieving information, approaches for evaluating plagiarism detection are commonly based on precision, recall, and F-measure [67]. While language dependence affects the diversity of detection targets, abstraction level affects both detection accuracy and generality. Efficiency determines how quickly and efficiently resources are detected, while extensibility guarantees application across a range of project sizes and contexts. These elements are essential to software development and maintenance because they determine the quality and usefulness of code similarity detection [160].

A variety of criteria that offer information about these algorithms’ performance are used to evaluate intrinsic plagiarism detection techniques[161]. These metrics include clustering evaluation criteria, classification-based measurements, and metrics created especially for activities involving the detection of plagiarism [19]. The following classification measures are frequently used: F1 Score, Accuracy, Precision, and Recall. Precision calculates the percentage of genuine positives among expected positives, whereas Accuracy measures the total correctness of forecasts. For datasets that are not balanced, the F1 Score provides a balance between recall and precision [80]. A statistic called granularity is used in clustering techniques to assess how detailed the clustering results are. It demonstrates the clusters’ fine graininess. Furthermore, an assessment criterion called Plagiarism Detection, or PlagaDet for short, assesses how well clustering techniques perform when it comes to plagiarism detection. Furthermore, metrics designed specifically for comparing clustering partitions include WindowDiff, WindowP, WindowR, and WindowF. Finally, a statistical measure called mean distance calculates the typical separation between data points inside clusters [28,69].

Findings and trends

Due to the abundance of material available on the internet and the strength of search engines, plagiarism is becoming a severe issue in many fields, including education. Usually, plagiarism is separated into two types: intentional and inadvertent. However, techniques for detecting plagiarism can also be applied in other domains, such as information retrieval, where a text is input and the most appropriate matches are identified. Systems for detecting plagiarism are useful not just in the sectors of education and information retrieval, but also in publishing, research, and litigation. In publication as well as research, it is essential to guarantee the authenticity and uniqueness of published work. Plagiarism detection technology can prevent academic misconduct and maintain the standard of published writing by identifying instances of repetition or similarity with previously published works. In legal proceedings, plagiarism detection can be used to identify instances of intellectual property theft or copyright infringement.

Over time, plagiarism detection algorithms have undergone significant evolution, utilizing various approaches to address the difficulties of detecting stylistic modifications and possible instances of plagiarism in a given work. Early research focused on distinguishing original from plagiarised content by quantifying stylistic features [118] and employing traditional discriminant analysis [122,138]. The character n-gram profiles approach was developed [109,110], using character-based n-gram features for detection. Principal component analysis was employed in the same year by PCA with distance scores [142] to identify stylistic variations according to distance metrics. The use of Lempel-Ziv compression [80,141], which makes use of compression algorithms to identify inconsistencies and stylistic shifts, represents a further advancement. The same year, the idea of style change functions [54,69] was put forth to detect changes in a document’s writing style. To organize related text segments according to stylistic similarities, clustering algorithms including Agglomerative Hierarchical clustering [156,157] and K-Means clustering [154,155] were used. Subsequent methods like Transformer models [18], Decision Trees [146], and Support Vector Machines [19,71] contributed significantly to the field. Random Forest, AdaBoost, Multilayer Perception (MLP), and LightGBM ensemble learning [162] demonstrated the advantages of mixing several models for better results. LSTM + BERT [21], and GAN-based encoder and decoder [163] are examples of recent developments.

[Figure omitted. See PDF.]

In recent times, there has been a noticeable increase in the performance of large language models (LLMs) in several tasks [164]. The increasing prevalence of plagiarism in academic writing is a significant concern linked to the increasing reliance on ChatGPT [11]. This might potentially jeopardize the objective and integrity of assignments and tests. Various AI-generated text classifiers and tools, like Log Likelihood [165], RoBERTa-QA (HC3) [44], GPTZero [11], OpenAI Classifier [14], DetectGPT [58], and Turntin [166] Plagiarisma, Plagiarismdetect, Duplichecker, Grammarly, PlagAware, Quetext, PlagScan [28], have been developed by researchers to lessen the potential for plagiarism resulting from the use of LLMs. To enhance comprehension, a list of abbreviations is provided in Table 3.

Challenges in plagiarism detection

We have examined several papers on plagiarism identification [19,69,139,142]. There is a dearth of research on the detection of plagiarism in tables and figures in natural language, and the technologies now in use are unable to identify plagiarised images, tables, figures, formulas, and scanned papers. The security and privacy of the technologies provide additional difficulty. Certain tools save the documents that users provide in their repository. One well-known commercial tool that stores student papers and assignments in its database for potential plagiarism detection in the future is Turnitin. It is regarded as an unlawful activity [28].

1. The lack of a reference list or reliable sources for comparison is one major obstacle. It is challenging to differentiate between cases of plagiarism and true stylistic modifications due to the absence of external references [116].

2. The diversity and intricacy of writing styles present another difficulty. Different writing styles can be used by authors either purposefully or accidentally, which can result in variances in their work [119].

3. The other difficulty in intrinsic plagiarism detection is identifying the best attributes and representing them. The writing style has been studied using a variety of criteria, including word frequencies, grammatical structures, character n-grams, and semantic patterns [28].

4. Plagiarised passages that are scattered or broken up across a manuscript can be considered instances of intrinsic plagiarism. These partial plagiarism cases, in which just particular sentences or phrases are replicated, call for sophisticated algorithms capable of detecting minute variations and parallels in the textual content of the document [150].

5. Because plagiarism writers might use complex techniques to hide or modify their writing styles, it becomes more challenging to identify cases of plagiarism based only on inherent traits [80].

6. Another issue is the absence of linguistic heterogeneity and diversity among languages with limited resources. The variety of dialects, registers, and writing styles seen in high-resource languages add to the language’s complexity and diversity [90].

7. Plagiarism detection in low-resource languages is further complicated by the lack of language-specific stylometric cues. Writing style is mostly captured and quantified by stylistic elements including word frequency, n-grams, and grammatical patterns.

8. The scarcity of corpora and reference materials for languages with little resources is another issue. For referencing and contrasting, scholars working in high-resource languages, have access to extensive databases of texts, books, and internet resources. These resources, however, could be hard to come by, insufficient, or difficult to access digitally in languages with limited resources [167].

9. Report generation and processing take a lot of time, which presents difficult issues. The larger the document, the longer it takes and the more bandwidth it requires. Because of its extensive feature set and large user base, Turnitin is regarded as one of the best plagiarism detection tools [28].

10. Significant obstacles in the detection of plagiarism also stem from technical limitations and restrictions. Although they don’t need to be downloaded and installed on a user’s computer, web-based tools still need a fast internet connection. A finite number of documents with a finite file size can be handled at once using the current tools. They take a long time to review a large number of documents [28].

Conclusion

Detecting plagiarism is a crucial area of text analysis that searches a document for instances of duplicate content and establishes whether or not the same author wrote a piece of the text. With the advent of large language model-based content creation systems like ChatGPT that are accessible to the public, the issue of intrinsic plagiarism has become more significant in a variety of businesses. Because computers are available and used in classrooms, and because electronic information on the internet is typically widely accessible, more and more students are plagiarizing. To cope with this shifting environment, there is a growing need for accurate and dependable detection techniques. This study examines the efficacy of various plagiarism detection methods and compares their ability to discern between information produced by artificial intelligence (AI) and human-generated content. To give readers a general picture of the state of the research on computational techniques for plagiarism detection, this article thoroughly assesses 189 research papers that were published between 2019 and 2024. We suggest a new technically focused framework for attempts to prevent and identify plagiarism, academic plagiarism types, and computational techniques for detecting plagiarism to organize the way the research contributions are presented. We show that there is a wealth of active research on the subject of plagiarism detection. During the period we reviewed, a great deal of progress has been achieved in the field of automatically recognizing plagiarism that is highly hidden and therefore challenging to detect. The exploration of features of non-textual content, the use of machine learning, and improved techniques for semantic text analysis are the key forces behind these breakthroughs. Our analysis leads us to the conclusion that the combination of several analytical methodologies for textual and characteristics of non-textual content is the most potential area for future research contributions to further enhance plagiarism detection.

Future research directions

As technology advances, we recommend that authors use graph-based structures, images, references, and citations in their future work to further boost the detection process. We want to use deep learning and graph clustering approaches in conjunction with graph-based structures to identify plagiarism in the future.

References

1. 1. Chaka C. Reviewing the performance of AI detection tools in differentiating between AI-generated and human-written texts: A literature and integrative Hybrid review. J Appl Learn Teach. 2024;7(1).

* View Article

* Google Scholar

2. 2. Ibrahim K. Using AI-based detectors to control AI-assisted plagiarism in ESL writing: “The Terminator Versus the Machines”. Lang Test Asia. 2023:13(1);46.

* View Article

* Google Scholar

3. 3. Heumann M, Kraschewski T, Breitner MH. ChatGPT and GPTZero in research and social media: A sentiment- and topic-based analysis. In: Twenty-ninth Americas conference on information systems, Panama; 2023. 2023;.

4. 4. Xie Y, Wu S, Chakravarty S. AI meets AI: Artificial intelligence and academic integrity-A survey on mitigating AI-assisted cheating in computing education. In: Proceedings of the 24th annual conference on information technology education; 2023. p. 79–83.

5. 5. Ansari M, Pandey D, Alenezi M. STORE: Security threat oriented requirements engineering methodology. J King Saud Univ-Comput Inform Sci. 2022;34(2):191–203.

* View Article

* Google Scholar

6. 6. Ansari M, Baz A, Alhakami H, Alhakami W, Kumar R, Khan R. P-STORE: Extension of STORE methodology to elicit privacy requirements. Arab J Sci Eng. 2021;46:8287–310.

* View Article

* Google Scholar

7. 7. Elkhatat A, Elsaid K, Almeer S. Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text. Int J Educ Integr. 2023:19(1);17.

* View Article

* Google Scholar

8. 8. Crawford J, Cowling M, Allen K. Leadership is needed for ethical ChatGPT: Character, assessment, and learning using artificial intelligence (AI). J Univ Teach Learn Pract. 2023:20(3);02.

* View Article

* Google Scholar

9. 9. King MR. ChatGPT. A conversation on artificial intelligence, chatbots, and plagiarism in higher education. Cell Mol Bioeng 2023;16(1):1–2. pmid:36660590

* View Article

* PubMed/NCBI

* Google Scholar

10. 10. Dwivedi Y, Kshetri N, Hughes L, Slade E, Jeyaraj A, Kar A. So what if ChatGPT wrote it? Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int J Inform Manage. 2023;71:102642.

* View Article

* Google Scholar

11. 11. Khalil M, Er E. Will ChatGPT get you caught? Rethinking of plagiarism detection. In: Proceedings of the international conference on human-computer interaction. Springer; 2023. .

12. 12. Altheneyan A, Menai M. Evaluation of state-of-the-art paraphrase identification and its application to automatic plagiarism detection. Int J Pattern Recogn Artif Intell. 2020:34(04);2053004.

* View Article

* Google Scholar

13. 13. Alhakami W, Binmahfoudh A, Baz A, Alhakami H, Ansari M, Khan R. Atrocious impinging of COVID-19 pandemic on software development industries. Comput Syst Sci Eng. 2021;36(2):323–38.

* View Article

* Google Scholar

14. 14. Meuschke N. New AI classifier for indicating AI-written text. OpenAI. Springer; 2023.

15. 15. Mindner L, Schlippe T, Schaaff K. Classification of human- and AI-generated texts: Investigating features for ChatGPT. In: Proceedings of the international conference on artificial intelligence in education technology; 2023. p. 152–70.

16. 16. Creo A, Pudasaini S. Evading AI-generated content detectors using homoglyphs. arXiv preprint arXiv:240611239. 2024.

* View Article

* Google Scholar

17. 17. Kirchner. Analyzing non-textual content elements to detect academic plagiarism. OpenAI; 2023.

18. 18. Emi B, Spero M. Technical report on the Checkfor.ai AI-generated text classifier. arXiv preprint arXiv:240214873. 2024.

19. 19. El-Rashidy M, Mohamed R, El-Fishawy N, Shouman M. An effective text plagiarism detection system based on feature selection and SVM techniques. Multimedia Tools Applic. 2024;83(1):2609–46.

* View Article

* Google Scholar

20. 20. Baishya K. Plagiarism detection software: An overview. Res Publ Ethics. 2024:281.

21. 21. Xiong J, Yang J, Yan L, Awais M, Khan AA, Alizadehsani R. Efficient reinforcement learning-based method for plagiarism detection boosted by a population-based algorithm for pretraining weights. Expert Syst Applic. 2024;238:122088.

* View Article

* Google Scholar

22. 22. Abbaszadeh Shahri A, Shan C, Larsson S, Johansson F. Normalizing large scale sensor-based MWD data: An automated method toward a unified database. Sensors (Basel) 2024;24(4):1209. pmid:38400367

* View Article

* PubMed/NCBI

* Google Scholar

23. 23. Oloo V, Otieno C, Wanzare L. A literature survey on writing style change detection based on machine learning: State-of-the-art-review. Int J Comput Trends Technol. 2022;70(5):15–32.

* View Article

* Google Scholar

24. 24. Alzahrani S, Aljuaid H. Identifying cross-lingual plagiarism using rich semantic features and deep neural networks: A study on Arabic-English plagiarism cases. J King Saud Univ-Comput Inform Sci. 2022;34(4):1110–23.

* View Article

* Google Scholar

25. 25. Ishaq M, Abid A, Farooq MS, Manzoor MF, Farooq U, Abid K, et al. Advances in database systems education: Methods, tools, curricula, and way forward. Educ Inf Technol (Dordr) 2023;28(3):2681–725. pmid:36061104

* View Article

* PubMed/NCBI

* Google Scholar

26. 26. Farooq U, Rahim M, Sabir N, Hussain A, Abid A. Advances in machine translation for sign language: Approaches, limitations, and challenges. Neural Comput Applic.. 2021;33(21):14357–99.

* View Article

* Google Scholar

27. 27. Ramzan M, Abid A, Khan H, Awan S, Ismail A, Ahmed M, et al. A review on state-of-the-art violence detection techniques. IEEE Access. 2019;7:107560–75.

* View Article

* Google Scholar

28. 28. Jiffriya M, Jahan M, Ragel R. Plagiarism detection tools and techniques: A comprehensive survey. J Sci-FAS-SEUSL. 2021;2(02):47–64.

* View Article

* Google Scholar

29. 29. Tehseen R, Farooq MS, Abid A. Earthquake prediction using expert systems: A systematic mapping study. Sustainability 2020;12(6):2420.

* View Article

* Google Scholar

30. 30. Zellers R, Holtzman A, Rashkin H, Bisk Y, Farhadi A, Roesner F, et al. Defending against neural fake news. Advances in neural information processing systems. 2019;32.

* View Article

* Google Scholar

31. 31. Gehrmann S, Strobelt H, Rush A. Gltr: Statistical detection and visualization of generated text. arXiv preprint. 2019.

* View Article

* Google Scholar

32. 32. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog. 2019:1(8);9.

* View Article

* Google Scholar

33. 33. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D. Roberta: A robustly optimized bert pretraining approach. arXiv preprint. 2019.

* View Article

* Google Scholar

34. 34. Dalai T, Mishra T, Sa P. Deep learning-based POS tagger and chunker for Odia language using pre-trained transformers. ACM Transactions on Asian and Low-Resource Language Information Processing. 2024;23(2):1–23.

* View Article

* Google Scholar

35. 35. Adelani D, Mai H, Fang F, Nguyen H, Yamagishi J, Echizen I. Generating sentiment-preserving fake online reviews using neural language models and their human-and machine-based detection. Advanced information networking and applications: Proceedings of the 34th international conference on advanced information networking and applications (AINA-2020); 2020. .

36. 36. Uchendu A, Le T, Shu K, Lee D. Authorship Attribution for Neural Text Generation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020:8384–95. https://doi.org/10.18653/v1/2020.emnlp-main.673

37. 37. Keskar NS, McCann B, Varshney LR, Xiong C, Socher R. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint. 2019.

* View Article

* Google Scholar

38. 38. Lample G, Conneau A. Cross-lingual language model pretraining. arXiv preprint. 2019.

* View Article

* Google Scholar

39. 39. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV. Xlnet: Generalized autoregressive pretraining for language understanding. Adv Neural Inform Process Syst. 2019;32.

* View Article

* Google Scholar

40. 40. Dathathri S, Madotto A, Lan J, Hung J, Frank E, Molino P. Plug and play language models: A simple approach to controlled text generation. arXiv preprint. 2019.

* View Article

* Google Scholar

41. 41. Fagni T, Falchi F, Gambini M, Martella A, Tesconi M. TweepFake: About detecting deepfake tweets. PLoS One 2021;16(5):e0251415. pmid:33984021

* View Article

* PubMed/NCBI

* Google Scholar

42. 42. Harrag F, Debbah M, Darwish K, Abdelali A. Bert transformer model for detecting Arabic GPT2 auto-generated tweets. arXiv preprint. 2021.

* View Article

* Google Scholar

43. 43. Jawahar G, Abdul-Mageed M, Lakshmanan L. Automatic detection of entity manipulated text using factual knowledge. arXiv preprint. 2022.

* View Article

* Google Scholar

44. 44. Guo B, Zhang X, Wang Z, Jiang M, Nie J, Ding Y. How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection. arXiv preprint. 2023.

* View Article

* Google Scholar

45. 45. Fan A, Jernite Y, Perez E, Grangier D, Weston J, Auli M. ELI5: Long form question answering. arXiv. 2019.

* View Article

* Google Scholar

46. 46. Kirchenbauer J, Geiping J, Wen Y, Shu M, Saifullah K, Kong K. On the reliability of watermarks for large language models. arXiv preprint. 2023.

* View Article

* Google Scholar

47. 47. Fernandez P, Chaffin A, Tit K, Chappelier V, Furon T. Three bricks to consolidate watermarks for large language models. In: 2023 IEEE international workshop on information forensics and security (WIFS); 2023. .

48. 48. Christ M, Gunn S, Zamir O. Undetectable watermarks for language models. arXiv preprint. 2023;2306.09194.

49. 49. Khaled F, Al-Tamimi MSH. Plagiarism detection methods and tools: An overview. Iraqi J Sci. 2021; p. 2771–2783.

50. 50. Fuad A, Wicaksono A, Aqib M, Khoiruddin M, Fajar A, Mustamir K. AI hybrid based plagiarism detection system creation. In: Proceedings of the 4th international conference on advance computing and innovative technologies in engineering (ICACITE); 2024. .

51. 51. Pudasaini S, Miralles-Pechuán L, Lillis D, Llorens Salvador M. Survey on AI-generated plagiarism detection: The impact of large language models on academic integrity. J Acad Ethics. 2024:1–34.

52. 52. Izi AN, Anggraini FN, Regita R, Rabiatuladawiyah R. A development of the Turnitin system in improving plagiarism detection for Islamic religious education studies. Suhuf. 2024;36(2).

* View Article

* Google Scholar

53. 53. Nketsiah I, Imoro O, Barfi KA. Postgraduate students’ perception of plagiarism, awareness, and use of Turnitin text-matching software. Acc Res 2024;31(7):786–802. pmid:36693789

* View Article

* PubMed/NCBI

* Google Scholar

54. 54. Hourrane O. Rich style embedding for intrinsic plagiarism detection. Int J Adv Comput Sci Applic. 2019;10(11).

55. 55. Mukhtar N, Khan M. Effective lexicon-based approach for Urdu sentiment analysis. Artif Intell Rev. 2020;53(4):2521–48.

* View Article

* Google Scholar

56. 56. Khonji M, Iraqi Y, Mekouar L. Authorship identification of electronic texts. IEEE Access. 2021;9:101124–46.

* View Article

* Google Scholar

57. 57. Quidwai MA, Li C, Dube P. Beyond black box ai-generated plagiarism detection: From sentence to document level. arXiv preprint. 2023;230608122.

* View Article

* Google Scholar

58. 58. Mitchell E, Lee Y, Khazatsky A, Manning C, Finn C. Detectgpt: Zero-shot machine-generated text detection using probability curvature. In: Proceedings of the international conference on machine learning; 2023. .

59. 59. Alshammari H, El-Sayed A, Elleithy K. Ai-generated text detector for Arabic language using encoder-based transformer architecture. Big Data Cogn Comput. 2024:8(3);32.

* View Article

* Google Scholar

60. 60. Widyassari A, Rustad S, Shidik G, Noersasongko E, Syukur A, Affandy A, et al. Review of automatic text summarization techniques & methods. J King Saud Univ-Comput Inform Sci. 2022;34(4):1029–46.

* View Article

* Google Scholar

61. 61. Iyer A, Vosoughi S. Style change detection using BERT. CLEF (Working Notes). 2020;93:106.

* View Article

* Google Scholar

62. 62. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20(1):37–46.

* View Article

* Google Scholar

63. 63. P´erez J, D´ıaz J, Garcia-Martin J, Tabuenca B. Systematic literature reviews in software engineering—Enhancement of the study selection process using Cohen’s kappa statistic. J Syst Softw. 2020;168:110657.

* View Article

* Google Scholar

64. 64. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74. pmid:843571

* View Article

* PubMed/NCBI

* Google Scholar

65. 65. Chaka C. Fourth industrial revolution—A review of applications, prospects, and challenges for artificial intelligence, robotics and blockchain in higher education. Res Pract Technol Enhanced Learn. 2023;18:002.

* View Article

* Google Scholar

66. 66. Chaka C. Is Education 4.0 a sufficient innovative, and disruptive educational trend to promote sustainable open education for higher education institutions? A review of literature trends. Front Educ. 2022;7:824976.

67. 67. Bischoff S, Deckers N, Schliebs M, Thies B, Hagen M, Stamatatos E. The importance of suppressing domain style in authorship analysis. CoRR. 2020.

* View Article

* Google Scholar

68. 68. Yanaka H, Mineshima K, Bekki D, Inui K, Sekine S, Abzianidze L. Can neural networks understand monotonicity reasoning? arXiv preprint. 2019;1906.06448. https://doi.org/10.48550/ arXiv.1906.06448

69. 69. Alsallal M, Iqbal R, Amin S, James A. Intrinsic plagiarism detection using latent semantic indexing and stylometry. In: 2013 sixth international conference on developments in eSystems engineering; 2013. .

70. 70. Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE international conference on computer vision; 2015. .

71. 71. AlSallal M, Iqbal R, Palade V, Amin S, Chang V. An integrated approach for intrinsic plagiarism detection. Fut Gen Comput Syst. 2019;96:700–12.

* View Article

* Google Scholar

72. 72. Tian J, Lan M. ECNU at SemEval-2016 Task 1: Leveraging word embedding from macro and micro views to boost performance for semantic textual similarity. In: Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016); 2016. .

73. 73. Li X, Li J. Angle-optimized text embeddings. arXiv preprint. 2023.

* View Article

* Google Scholar

74. 74. Latif S, Qayyum A, Usman M, Qadir J. Cross lingual speech emotion recognition: Urdu vs. Western languages. In: 2018 international conference on frontiers of information technology (FIT); 2018. p. 88–93. https://doi.org/10.1109/fit.2018.00023

75. 75. Datahub. InAra plagiarism detection corpus. Datahub; 2013.

* View Article

* Google Scholar

76. 76. Bensalem I, Rosso P, Chikhi S. Intrinsic plagiarism detection using n-gram classes. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP); 2014. .

77. 77. Haseeb M, Manzoor MF, Farooq MS, Farooq U, Abid A. A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu. Data Brief. 2023;52:109857. pmid:38161660

* View Article

* PubMed/NCBI

* Google Scholar

78. 78. Vasuteja A, Reddy AV, Pravin A. Beyond copy paste: Plagiarism detection using machine learning. In: 2024 international conference on inventive computation technologies (ICICT); 2024. p. 245–51. https://doi.org/10.1109/icict60155.2024.10544470

79. 79. Patil R, Kadam V, Nakate R, Kadam S, Pattade S, Mitkari M. A novel natural language processing based model for plagiarism detection. In: 2024 international conference on emerging smart computing and informatics (ESCI); 2024. .

80. 80. Manzoor M, Farooq M, Haseeb M, Farooq U, Khalid S, Abid A. Exploring the landscape of intrinsic plagiarism detection: Benchmarks, techniques, evolution, and challenges. IEEE Access. 2023;11:140519–45.

* View Article

* Google Scholar

81. 81. Albahra S, Gorbett T, Robertson S, D’Aleo G, Kumar SVS, Ockunzzi S, et al. Artificial intelligence and machine learning overview in pathology & laboratory medicine: A general review of data preprocessing and basic supervised concepts. Semin Diagn Pathol 2023;40(2):71–87. pmid:36870825

* View Article

* PubMed/NCBI

* Google Scholar

82. 82. Mallikharjuna Rao K, Saikrishna G, Supriya K. Data preprocessing techniques: Emergence and selection towards machine learning models-a practical review using HPA dataset. Multimedia Tools Applic. 2023;82(24):37177–96.

* View Article

* Google Scholar

83. 83. Siino M, Tinnirello I, La Cascia M. Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on transformers and traditional classifiers. Inform. Syst. 2024;121:102342.

* View Article

* Google Scholar

84. 84. Rahimi Z, Homayounpour M. The impact of preprocessing on word embedding quality: A comparative study. Lang Resour Eval. 2023;57(1):257–91.

* View Article

* Google Scholar

85. 85. Chai C. Comparison of text preprocessing methods. Nat Lang Eng. 2023;29(3):509–53.

* View Article

* Google Scholar

86. 86. Babanejad N, Davoudi H, Agrawal A, An A, Papagelis M. The role of preprocessing for word representation learning in affective tasks. IEEE Trans Affect Comput. 2023;15(1):254–72.

* View Article

* Google Scholar

87. 87. Toraman C, Yilmaz EH, Şahi̇nuç F, Ozcelik O. Impact of tokenization on language models: An analysis for Turkish. ACM Trans Asian Low-Resour Lang Inf Process 2023;22(4):1–21.

* View Article

* Google Scholar

88. 88. Korablev Y, Loseva D, Lonchina A. Methods for preprocessing and classification of text data in question-answer information systems. In: 2024 international conference on information processes and systems development and quality assurance (IPS); 2024. p. 27–32.

89. 89. Nazir S, Asif M, Rehman M, Ahmad S. Machine learning based framework for fine-grained word segmentation and enhanced text normalization for low resourced language. PeerJ Comput Sci. 2024;10:e1704. pmid:39669469

* View Article

* PubMed/NCBI

* Google Scholar

90. 90. Archana S, Prakash J. An effective undersampling method for biomedical named entity recognition using machine learning. Evol Syst; 2024. p. 1–9.

91. 91. Chavan T, Patil S. Named entity recognition (NER) for news articles. Development (IJAIRD). 2024;2(1):103–12.

* View Article

* Google Scholar

92. 92. Savci P, Das B. Structured named entity recognition (NER) in biomedical texts using pre-trained language models. In: 2024 12th international symposium on digital forensics and security (ISDFS); 2024. .

93. 93. Frank E, Oluwaseyi J, Olaoye G. Data preprocessing techniques for NLP in BI. Springer; 2024.

94. 94. Nafea A, Muayad M, Majeed R, Ali A, Bashaddadh O, Khalaf M. A brief review on preprocessing text in Arabic language dataset: Techniques and challenges. Babylonian J Artif Intell. 2024;2024:46–53.

* View Article

* Google Scholar

95. 95. Bharti SK, Gupta RK, Patel S, Shah M. Context-based bigram model for POS tagging in Hindi: A heuristic approach. Ann Data Sci. 2024;11(1):347–378.

* View Article

* Google Scholar

96. 96. Wikacek M, Rybak P, Pszenny L, Wroblewska A. NLPre: A revised approach towards language-centric benchmarking of Natural Language Preprocessing systems. arXiv preprint arXiv:240304507. 2024.

* View Article

* Google Scholar

97. 97. Mounica B, Lavanya K. Feature selection method on twitter dataset with part-of-speech (PoS) pattern applied to traffic analysis. Int J Syst Assur Eng Manag. 2024;15(1):110–123.

* View Article

* Google Scholar

98. 98. Wei C, Pang R, Kuo CCJ. GWPT: A green word-embedding-based POS tagger. arXiv preprint. 2024;240107475.

* View Article

* Google Scholar

99. 99. Boukhlif M, Hanine M, Kharmoum N, Noriega A, Obeso D, Ashraf I. Natural language processing-based software testing: A systematic literature review. IEEE Access. 2024.

* View Article

* Google Scholar

100. 100. Bozyigit F, Bardakci T, Khalilipour A, Challenger M, Ramackers G, Babur O. Generating domain models from natural language text using NLP: A benchmark dataset and experimental comparison of tools. Softw Syst Model. 2024:1–19.

101. 101. Kutsenok L, Korablev Y. Research of applicability of natural language processing models to the task of analyzing technical tasks and specifications for software development. In: 2024 XXVII international conference on soft computing and measurements (SCM); 2024. p. 200–3.

102. 102. Bourahouat G, Abourezq M, Daoudi N. Word embedding as a semantic feature extraction technique in Arabic natural language processing: An overview. Int Arab J Inf Technol. 2024;21(2):313–25.

* View Article

* Google Scholar

103. 103. Gorai J, Shaw DK. Semantic difference-based feature extraction technique for fake news detection. J Supercomput. 2024; p. 1–23.

104. 104. Tavabi N, Singh M, Pruneski J, Kiapour AM. Systematic evaluation of common natural language processing techniques to codify clinical notes. PLoS One 2024;19(3):e0298892. pmid:38451905

* View Article

* PubMed/NCBI

* Google Scholar

105. 105. Gupta A, Chadha A, Tewari V. A natural language processing model on BERT and YAKE technique for keyword extraction on sustainability reports. IEEE Access. 2024.

* View Article

* Google Scholar

106. 106. Xie R, Ahia O, Tsvetkov Y, Anastasopoulos A. Extracting lexical features from dialects via interpretable dialect classifiers. arXiv preprint. 2024;240217914.

* View Article

* Google Scholar

107. 107. Alfreihat M, Almousa O, Tashtoush Y, AlSobeh A, Mansour K, Migdady H. Emo-SL framework: Emoji sentiment lexicon using text-based features and machine learning for sentiment analysis. IEEE Access. 2024.

* View Article

* Google Scholar

108. 108. Ahanin Z, Ismail MA, Singh NSS, AL-Ashmori A. Hybrid feature extraction for multi-label emotion classification in English text messages. Sustainability 2023;15(16):12539.

* View Article

* Google Scholar

109. 109. Ksieniewicz P, Zyblewski P, Borek-Marciniec W, Kozik R, Choraś M, Woźniak M. Alphabet flatting as a variant of n-gram feature extraction method in ensemble classification of fake news. Eng Applic Artif Intell. 2023;120:105882.

* View Article

* Google Scholar

110. 110. Han X, Cui S, Liu S, Zhang C, Jiang B, Lu Z. Network intrusion detection based on n-gram frequency and time-aware transformer. Comput Secur. 2023;128:103171.

* View Article

* Google Scholar

111. 111. Hu M, Pan S, Li Y, Yang X. Advancing medical imaging with language models: A journey from n-grams to ChatGPT. arXiv preprint arXiv:230404920. 2023.

* View Article

* Google Scholar

112. 112. Das M, Alphonse P. A comparative study on TF-IDF feature weighting method and its analysis using unstructured dataset. arXiv preprint. 2023.

* View Article

* Google Scholar

113. 113. Makhmutova L, Ross R, Salton G. Impact of character n-grams attention scores for English and Russian News articles authorship attribution. In: Proceedings of the 38th ACM/SIGAPP symposium on applied computing. 2023:939–41. https://doi.org/10.1145/3555776.3577856

114. 114. Reimer J, Schmidt S, Fröbe M, Gienapp L, Scells H, Stein B. The archive query log: mining millions of search result pages of hundreds of search engines from 25 years of web archives. In: Proceedings of the 46th international ACM SIGIR conference on research and development in information retrieval; 2023. .

115. 115. Bakhteev O, Chekhovich Y, Grabovoy A, Gorbachev G, Gorlenko T, Grashchenkov K. Cross-language plagiarism detection: A case study of European languages academic works. In: Academic Integrity: Broadening practices, technologies, and the role of students: Proceedings from the European conference on academic integrity and plagiarism 2021; 2023. .

116. 116. Ahmed T. Exploring mathematical models and algorithms for plagiarism detection in text documents: A proof of concept. Research Square; 2024.

117. 117. Chang C, Jhang S, Wu S, Roy D. JCF: Joint coarse-and fine-grained similarity comparison for plagiarism detection based on NLP. J Supercomput. 2024;80(1):363–94.

* View Article

* Google Scholar

118. 118. Suljic A, Hossain MS. Towards performance improvement of authorship attribution. IEEE Access. 2024.

* View Article

* Google Scholar

119. 119. Zamir MT, Ayub MA, Gul A, Ahmad N, Ahmad K. Stylometry analysis of multi-authored documents for authorship and author style change detection. arXiv preprint arXiv:240106752. 2024.

* View Article

* Google Scholar

120. 120. He X, Lashkari A, Vombatkere N, Sharma D. Authorship attribution methods, challenges, and future research directions: A comprehensive survey. Information. 2024:15(3);131.

* View Article

* Google Scholar

121. 121. Nahar K, Alshtaiwi M, Alikhashashneh E, Shatnawi N, Al-Shannaq M, Abual-Rub M. Plagiarism detection system by semantic and syntactic analysis based on latent Dirichlet allocation algorithm. Int J Adv Soft Comput Applic. 2024;16(1).

122. 122. Parmar S, Jain B. VIBRANT-WALK: An algorithm to detect plagiarism of figures in academic papers. Expert Syst Applic. 2024;252:124251.

* View Article

* Google Scholar

123. 123. Mittal S, Mishra A, Khatter K. Psquad: Plagiarism detection and document similarity of Hindi text. Multimedia Tools Applic. 2024;83(6):17299–326.

* View Article

* Google Scholar

124. 124. Johnson S, Murty M, Navakanth I. A detailed review on word embedding techniques with emphasis on word2vec. Multimedia Tools Applic. 2024;83(13):37979–8007.

* View Article

* Google Scholar

125. 125. Yang C. Learning word embedding with better distance weighting and window size scheduling. arXiv preprint. 2024.

* View Article

* Google Scholar

126. 126. Zeng Y, Li Z, Chen Z, Ma H. Aspect-level sentiment analysis based on semantic heterogeneous graph convolutional network. Front Comput Sci. 2023:17(6);176340.

* View Article

* Google Scholar

127. 127. Ameer I, Bölücü N, Sidorov G, Can B. Emotion classification in texts over graph neural networks: Semantic representation is better than syntactic. IEEE Access. 2023;11:56921–34.

* View Article

* Google Scholar

128. 128. Sousa RT, Silva S, Pesquita C. Explaining protein-protein interactions with knowledge graph-based semantic similarity. Comput Biol Med. 2024;170:108076. pmid:38308873

* View Article

* PubMed/NCBI

* Google Scholar

129. 129. Wu Y, Pan X, Li J, Dou S, Dong J, Wei D. Knowledge graph-based hierarchical text semantic representation. Int J Intell Syst. 2024:2024(1);5583270.

* View Article

* Google Scholar

130. 130. Zhang J, Liu Z, Hu X, Xia X, Li S. Vulnerability detection by learning from syntax-based execution paths of code. IEEE Trans Softw Eng. 2023;49(8):4196–212.

* View Article

* Google Scholar

131. 131. Han D, Li Q, Zhang L, Xu T. A smart contract vulnerability detection model based on syntactic and semantic fusion learning. Wireless Commun Mobile Comput. 2023:2023(1);9212269.

* View Article

* Google Scholar

132. 132. Bouaine C, Benabbou F, Sadgali I. Word embedding for high performance cross-language plagiarism detection techniques. Int J Interact Mobile Technol. 2023;17(10).

133. 133. Mitkov R. The Oxford handbook of computational linguistics. Oxford University Press; 2022.

134. 134. Jayanth K, Mohan G, Kumar R. Indian language analysis with XLM-RoBERTa: Enhancing parts of speech tagging for effective natural language preprocessing. In: 2023 seventh international conference on image information processing (ICIIP); 2023. p. 850–4.

135. 135. Nambiar K, Peter SD, Idicula MS. Abstractive summarization of text document in Malayalam language: Enhancing attention model using POS tagging feature. ACM Trans Asian Low-Resour Lang Inform Process. 2023;22(2):1–14.

* View Article

* Google Scholar

136. 136. Tehseen A, Ehsan T, Liaqat H, Ali A, Al-Fuqaha A. Neural POS tagging of Shahmukhi by using contextualized word representations. J King Saud Univ-Comput Inform Sci. 2023;35(1):335–56.

* View Article

* Google Scholar

137. 137. Zukharova U. Check for plagiarism using text mining. Texas J Multidiscip Stud. 2023;19:73–8.

* View Article

* Google Scholar

138. 138. Wahle J, Ruas T, Folty‘nek T, Meuschke N, Gipp B. Identifying machine-paraphrased plagiarism. In: Proceedings of the international conference on information; 2022. .

139. 139. Pupovac V. The frequency of plagiarism identified by text-matching software in scientific articles: a systematic review and meta-analysis. Scientometrics. 2021;126(11):8981–9003.

* View Article

* Google Scholar

140. 140. El-Rashidy M, Mohamed R, El-Fishawy N, Shouman M. Reliable plagiarism detection system based on deep learning approaches. Neural Comput Applic. 2022;34(21):18837–58.

* View Article

* Google Scholar

141. 141. Ting C, Johnson N, Onunkwo U, Tucker JD. Faster classification using compression analytics. In: 2021 international conference on data mining workshops (ICDMW); 2021. p. 813–22.

* View Article

* Google Scholar

142. 142. Veisi H, Golchinpour M, Salehi M, Gharavi E. Multi-level text document similarity estimation and its application for plagiarism detection. Iran J Comput Sci. 2022;5(2):143–55.

* View Article

* Google Scholar

143. 143. Bensalem I, Rosso P, Chikhi S. On the use of character n-grams as the only intrinsic evidence of plagiarism. Lang Resour Eval. 2019;53(3):363–96.

* View Article

* Google Scholar

144. 144. Ríos-Toledo G, Posadas-Durán JPF, Sidorov G, Castro-Sánchez NA. Detection of changes in literary writing style using N-grams as style markers and supervised machine learning. PLoS One 2022;17(7):e0267590. pmid:35857768

* View Article

* PubMed/NCBI

* Google Scholar

145. 145. Awale N, Pandey M, Dulal A, Timsina B. Plagiarism detection in programming assignments using machine learning. J Artif Intell Capsule Networks. 2020;2(3):177–84.

146. 146. Costa VG, Pedreira CE. Recent advances in decision trees: An updated survey. Artif Intell Rev. 2023;56(5):4765–800.

* View Article

* Google Scholar

147. 147. Eppa A, Murali A. Source code plagiarism detection: A machine intelligence approach. In: 2022 IEEE fourth international conference on advances in electronics, computers and communications (ICAECC); 2022. .

148. 148. Lemantara J, Sunarto M, Hariadi B, Sagirani T, Amelia T. Prototype of online examination on MoLearn applications using text similarity to detect plagiarism. In: 2018 5th international conference on information technology, computer, and electrical engineering (ICITACEE); 2018. .

149. 149. Khan TF, Anwar W, Arshad H, Abbas SN. An empirical study on authorship verification for low resource language using hyper-tuned CNN approach. IEEE Access. 2023.

* View Article

* Google Scholar

150. 150. Alhijawi B, Jarrar R, AbuAlRub A, Bader A. Deep learning detection method for large language models-generated scientific content. arXiv preprint. 2024;240300828.

* View Article

* Google Scholar

151. 151. Kavatage A, Menon S, Devi S, Patil S, Shilpa S. Multi-model essay evaluation with optical character recognition and plagiarism detection. In: Intelligent communication technologies and virtual mobile networks; 2023. .

152. 152. Bao W, Dong J, Xu Y, Yang Y, Qi X. Exploring attentive Siamese LSTM for low-resource text plagiarism detection. Data Intell; 2023. .

153. 153. Altamimi A, Umer M, Hanif D, Alsubai S, Kim T, Ashraf I. Employing Siamese MaLSTM model and ELMO word embedding for Quora duplicate questions detection. IEEE Access. 2024.

* View Article

* Google Scholar

154. 154. Saeed A, Taqa A. A proposed approach for plagiarism detection in article documents. Sinkron. 2022;6(2):568–78.

* View Article

* Google Scholar

155. 155. Chang C-Y, Lee S-J, Wu C-H, Liu C-F, Liu C-K. Using word semantic concepts for plagiarism detection in text documents. Inf Retrieval J. 2021;24(4–5):298–321. https://doi.org/10.1007/s10791-021-09394-4

156. 156. Jagtap D, Ambekar S, Singh H, Sharma N. An approach to detecting writing styles based on clustering technique. In: 2024 IEEE international students’ conference on electrical, electronics and computer science (SCEECS); 2024. p. 1–7.

157. 157. Amaliah Y, Musu W, Fadlan M. Auto clustering source code to detect plagiarism of student programming assignments in Java programming language. In: 2021 3rd international conference on cybernetics and intelligent system (iCORIS); 2021. .

158. 158. Saini A, Sri MR, Thakur M. Intrinsic plagiarism detection system using stylometric features and DBSCAN. In: Proceedings of the 2021 international conference on computing, communication, and intelligent systems (ICCCIS); 2021. .

159. 159. Eppa A, Murali AH. Machine learning techniques for multisource plagiarism detection. In: 2021 IEEE international conference on computation system and information technology for sustainable solutions (CSITSS); 2021. .

160. 160. Lee G, Kim J, Choi M, Jang R, Lee R. Review of code similarity and plagiarism detection research studies. Appl Sci. 2023:13(20);11358.

* View Article

* Google Scholar

161. 161. Sri S, Dutta S. A survey on automatic text summarization techniques. J Phys: Conf Ser.2021;1:012044.

* View Article

* Google Scholar

162. 162. Hafeez H, Muneer I, Sharjeel M, Ashraf M, Nawab R. Urdu short paraphrase detection at sentence level. ACM Trans Asian Low-Resour Lang Inform Process. 2023;22(4):1–20.

* View Article

* Google Scholar

163. 163. Wu H. Dilated convolution for enhanced extractive summarization: A GAN-based approach with BERT word embedding. J Intell Fuzzy Syst. 2024;46(Preprint):1–14.

* View Article

* Google Scholar

164. 164. He X, Shen X, Chen Z, Backes M, Zhang Y. Mgtbench: Benchmarking machine-generated text detection. arXiv preprint. 2023.

* View Article

* Google Scholar

165. 165. Solaiman I, Brundage M, Clark J, Askell A, Herbert-Voss A, Wu J. Release strategies and the social impacts of language models. arXiv preprint. 2019.

* View Article

* Google Scholar

166. 166. Fowler G. We tested a new ChatGPT-detector for teachers. It flagged an innocent student. Washington Post; 2023.

167. 167. Almuqren L, Cristea A. AraCust: a Saudi Telecom Tweets corpus for sentiment analysis. PeerJ Comput Sci. 2021;7:e510. pmid:34084924

* View Article

* PubMed/NCBI

* Google Scholar

Citation: Sajid M, Sanaullah M, Fuzail M, Malik TS, Shuhidan SM (2025) Comparative analysis of text-based plagiarism detection techniques. PLoS ONE 20(4): e0319551. https://doi.org/10.1371/journal.pone.0319551

About the Authors:

Muhammad Sajid

Roles: Conceptualization, Formal analysis, Investigation, Methodology, Writing – original draft

Affiliation: Department of Computer Science, Air University, Islamabad, Pakistan

ORICD: https://orcid.org/0000-0003-4896-2025

Muhammad Sanaullah

Roles: Formal analysis, Resources

Affiliation: Department of Computer Science, Air University, Islamabad, Pakistan

Muhammad Fuzail

Roles: Investigation, Methodology, Validation, Visualization

Affiliation: Computer Science Department, NFC Institute of Engineering and Technology, Multan, Punjab, Pakistan

Tauqeer Safdar Malik

Roles: Formal analysis, Investigation, Resources, Supervision, Validation, Visualization, Writing – review & editing

E-mail: [email protected]

Affiliation: Department of Information & Communication Technology, Bahauddin Zakariya University, Multan, Punjab, Pakistan

ORICD: https://orcid.org/0000-0002-2064-807X

Shuhaida Mohamed Shuhidan

Roles: Formal analysis, Funding acquisition, Writing – review & editing

Affiliation: Centre for Research in Data Science, Computer and Information Sciences Department, Universiti Teknologi Petronas, Perak, Malaysia

[/RAW_REF_TEXT]

References

1. Chaka C. Reviewing the performance of AI detection tools in differentiating between AI-generated and human-written texts: A literature and integrative Hybrid review. J Appl Learn Teach. 2024;7(1).

2. Ibrahim K. Using AI-based detectors to control AI-assisted plagiarism in ESL writing: “The Terminator Versus the Machines”. Lang Test Asia. 2023:13(1);46.

3. Heumann M, Kraschewski T, Breitner MH. ChatGPT and GPTZero in research and social media: A sentiment- and topic-based analysis. In: Twenty-ninth Americas conference on information systems, Panama; 2023. 2023;.

4. Xie Y, Wu S, Chakravarty S. AI meets AI: Artificial intelligence and academic integrity-A survey on mitigating AI-assisted cheating in computing education. In: Proceedings of the 24th annual conference on information technology education; 2023. p. 79–83.

5. Ansari M, Pandey D, Alenezi M. STORE: Security threat oriented requirements engineering methodology. J King Saud Univ-Comput Inform Sci. 2022;34(2):191–203.

6. Ansari M, Baz A, Alhakami H, Alhakami W, Kumar R, Khan R. P-STORE: Extension of STORE methodology to elicit privacy requirements. Arab J Sci Eng. 2021;46:8287–310.

7. Elkhatat A, Elsaid K, Almeer S. Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text. Int J Educ Integr. 2023:19(1);17.

8. Crawford J, Cowling M, Allen K. Leadership is needed for ethical ChatGPT: Character, assessment, and learning using artificial intelligence (AI). J Univ Teach Learn Pract. 2023:20(3);02.

9. King MR. ChatGPT. A conversation on artificial intelligence, chatbots, and plagiarism in higher education. Cell Mol Bioeng 2023;16(1):1–2. pmid:36660590

10. Dwivedi Y, Kshetri N, Hughes L, Slade E, Jeyaraj A, Kar A. So what if ChatGPT wrote it? Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int J Inform Manage. 2023;71:102642.

11. Khalil M, Er E. Will ChatGPT get you caught? Rethinking of plagiarism detection. In: Proceedings of the international conference on human-computer interaction. Springer; 2023. .

12. Altheneyan A, Menai M. Evaluation of state-of-the-art paraphrase identification and its application to automatic plagiarism detection. Int J Pattern Recogn Artif Intell. 2020:34(04);2053004.

13. Alhakami W, Binmahfoudh A, Baz A, Alhakami H, Ansari M, Khan R. Atrocious impinging of COVID-19 pandemic on software development industries. Comput Syst Sci Eng. 2021;36(2):323–38.

14. Meuschke N. New AI classifier for indicating AI-written text. OpenAI. Springer; 2023.

15. Mindner L, Schlippe T, Schaaff K. Classification of human- and AI-generated texts: Investigating features for ChatGPT. In: Proceedings of the international conference on artificial intelligence in education technology; 2023. p. 152–70.

16. Creo A, Pudasaini S. Evading AI-generated content detectors using homoglyphs. arXiv preprint arXiv:240611239. 2024.

17. Kirchner. Analyzing non-textual content elements to detect academic plagiarism. OpenAI; 2023.

18. Emi B, Spero M. Technical report on the Checkfor.ai AI-generated text classifier. arXiv preprint arXiv:240214873. 2024.

19. El-Rashidy M, Mohamed R, El-Fishawy N, Shouman M. An effective text plagiarism detection system based on feature selection and SVM techniques. Multimedia Tools Applic. 2024;83(1):2609–46.

20. Baishya K. Plagiarism detection software: An overview. Res Publ Ethics. 2024:281.

21. Xiong J, Yang J, Yan L, Awais M, Khan AA, Alizadehsani R. Efficient reinforcement learning-based method for plagiarism detection boosted by a population-based algorithm for pretraining weights. Expert Syst Applic. 2024;238:122088.

22. Abbaszadeh Shahri A, Shan C, Larsson S, Johansson F. Normalizing large scale sensor-based MWD data: An automated method toward a unified database. Sensors (Basel) 2024;24(4):1209. pmid:38400367

23. Oloo V, Otieno C, Wanzare L. A literature survey on writing style change detection based on machine learning: State-of-the-art-review. Int J Comput Trends Technol. 2022;70(5):15–32.

24. Alzahrani S, Aljuaid H. Identifying cross-lingual plagiarism using rich semantic features and deep neural networks: A study on Arabic-English plagiarism cases. J King Saud Univ-Comput Inform Sci. 2022;34(4):1110–23.

25. Ishaq M, Abid A, Farooq MS, Manzoor MF, Farooq U, Abid K, et al. Advances in database systems education: Methods, tools, curricula, and way forward. Educ Inf Technol (Dordr) 2023;28(3):2681–725. pmid:36061104

26. Farooq U, Rahim M, Sabir N, Hussain A, Abid A. Advances in machine translation for sign language: Approaches, limitations, and challenges. Neural Comput Applic.. 2021;33(21):14357–99.

27. Ramzan M, Abid A, Khan H, Awan S, Ismail A, Ahmed M, et al. A review on state-of-the-art violence detection techniques. IEEE Access. 2019;7:107560–75.

28. Jiffriya M, Jahan M, Ragel R. Plagiarism detection tools and techniques: A comprehensive survey. J Sci-FAS-SEUSL. 2021;2(02):47–64.

29. Tehseen R, Farooq MS, Abid A. Earthquake prediction using expert systems: A systematic mapping study. Sustainability 2020;12(6):2420.

30. Zellers R, Holtzman A, Rashkin H, Bisk Y, Farhadi A, Roesner F, et al. Defending against neural fake news. Advances in neural information processing systems. 2019;32.

31. Gehrmann S, Strobelt H, Rush A. Gltr: Statistical detection and visualization of generated text. arXiv preprint. 2019.

32. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog. 2019:1(8);9.

33. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D. Roberta: A robustly optimized bert pretraining approach. arXiv preprint. 2019.

34. Dalai T, Mishra T, Sa P. Deep learning-based POS tagger and chunker for Odia language using pre-trained transformers. ACM Transactions on Asian and Low-Resource Language Information Processing. 2024;23(2):1–23.

35. Adelani D, Mai H, Fang F, Nguyen H, Yamagishi J, Echizen I. Generating sentiment-preserving fake online reviews using neural language models and their human-and machine-based detection. Advanced information networking and applications: Proceedings of the 34th international conference on advanced information networking and applications (AINA-2020); 2020. .

36. Uchendu A, Le T, Shu K, Lee D. Authorship Attribution for Neural Text Generation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020:8384–95. https://doi.org/10.18653/v1/2020.emnlp-main.673

37. Keskar NS, McCann B, Varshney LR, Xiong C, Socher R. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint. 2019.

38. Lample G, Conneau A. Cross-lingual language model pretraining. arXiv preprint. 2019.

39. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV. Xlnet: Generalized autoregressive pretraining for language understanding. Adv Neural Inform Process Syst. 2019;32.

40. Dathathri S, Madotto A, Lan J, Hung J, Frank E, Molino P. Plug and play language models: A simple approach to controlled text generation. arXiv preprint. 2019.

41. Fagni T, Falchi F, Gambini M, Martella A, Tesconi M. TweepFake: About detecting deepfake tweets. PLoS One 2021;16(5):e0251415. pmid:33984021

42. Harrag F, Debbah M, Darwish K, Abdelali A. Bert transformer model for detecting Arabic GPT2 auto-generated tweets. arXiv preprint. 2021.

43. Jawahar G, Abdul-Mageed M, Lakshmanan L. Automatic detection of entity manipulated text using factual knowledge. arXiv preprint. 2022.

44. Guo B, Zhang X, Wang Z, Jiang M, Nie J, Ding Y. How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection. arXiv preprint. 2023.

45. Fan A, Jernite Y, Perez E, Grangier D, Weston J, Auli M. ELI5: Long form question answering. arXiv. 2019.

46. Kirchenbauer J, Geiping J, Wen Y, Shu M, Saifullah K, Kong K. On the reliability of watermarks for large language models. arXiv preprint. 2023.

47. Fernandez P, Chaffin A, Tit K, Chappelier V, Furon T. Three bricks to consolidate watermarks for large language models. In: 2023 IEEE international workshop on information forensics and security (WIFS); 2023. .

48. Christ M, Gunn S, Zamir O. Undetectable watermarks for language models. arXiv preprint. 2023;2306.09194.

49. Khaled F, Al-Tamimi MSH. Plagiarism detection methods and tools: An overview. Iraqi J Sci. 2021; p. 2771–2783.

50. Fuad A, Wicaksono A, Aqib M, Khoiruddin M, Fajar A, Mustamir K. AI hybrid based plagiarism detection system creation. In: Proceedings of the 4th international conference on advance computing and innovative technologies in engineering (ICACITE); 2024. .

51. Pudasaini S, Miralles-Pechuán L, Lillis D, Llorens Salvador M. Survey on AI-generated plagiarism detection: The impact of large language models on academic integrity. J Acad Ethics. 2024:1–34.

52. Izi AN, Anggraini FN, Regita R, Rabiatuladawiyah R. A development of the Turnitin system in improving plagiarism detection for Islamic religious education studies. Suhuf. 2024;36(2).

53. Nketsiah I, Imoro O, Barfi KA. Postgraduate students’ perception of plagiarism, awareness, and use of Turnitin text-matching software. Acc Res 2024;31(7):786–802. pmid:36693789

54. Hourrane O. Rich style embedding for intrinsic plagiarism detection. Int J Adv Comput Sci Applic. 2019;10(11).

55. Mukhtar N, Khan M. Effective lexicon-based approach for Urdu sentiment analysis. Artif Intell Rev. 2020;53(4):2521–48.

56. Khonji M, Iraqi Y, Mekouar L. Authorship identification of electronic texts. IEEE Access. 2021;9:101124–46.

57. Quidwai MA, Li C, Dube P. Beyond black box ai-generated plagiarism detection: From sentence to document level. arXiv preprint. 2023;230608122.

58. Mitchell E, Lee Y, Khazatsky A, Manning C, Finn C. Detectgpt: Zero-shot machine-generated text detection using probability curvature. In: Proceedings of the international conference on machine learning; 2023. .

59. Alshammari H, El-Sayed A, Elleithy K. Ai-generated text detector for Arabic language using encoder-based transformer architecture. Big Data Cogn Comput. 2024:8(3);32.

60. Widyassari A, Rustad S, Shidik G, Noersasongko E, Syukur A, Affandy A, et al. Review of automatic text summarization techniques & methods. J King Saud Univ-Comput Inform Sci. 2022;34(4):1029–46.

61. Iyer A, Vosoughi S. Style change detection using BERT. CLEF (Working Notes). 2020;93:106.

62. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20(1):37–46.

63. P´erez J, D´ıaz J, Garcia-Martin J, Tabuenca B. Systematic literature reviews in software engineering—Enhancement of the study selection process using Cohen’s kappa statistic. J Syst Softw. 2020;168:110657.

64. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74. pmid:843571

65. Chaka C. Fourth industrial revolution—A review of applications, prospects, and challenges for artificial intelligence, robotics and blockchain in higher education. Res Pract Technol Enhanced Learn. 2023;18:002.

66. Chaka C. Is Education 4.0 a sufficient innovative, and disruptive educational trend to promote sustainable open education for higher education institutions? A review of literature trends. Front Educ. 2022;7:824976.

67. Bischoff S, Deckers N, Schliebs M, Thies B, Hagen M, Stamatatos E. The importance of suppressing domain style in authorship analysis. CoRR. 2020.

68. Yanaka H, Mineshima K, Bekki D, Inui K, Sekine S, Abzianidze L. Can neural networks understand monotonicity reasoning? arXiv preprint. 2019;1906.06448. https://doi.org/10.48550/ arXiv.1906.06448

69. Alsallal M, Iqbal R, Amin S, James A. Intrinsic plagiarism detection using latent semantic indexing and stylometry. In: 2013 sixth international conference on developments in eSystems engineering; 2013. .

70. Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE international conference on computer vision; 2015. .

71. AlSallal M, Iqbal R, Palade V, Amin S, Chang V. An integrated approach for intrinsic plagiarism detection. Fut Gen Comput Syst. 2019;96:700–12.

72. Tian J, Lan M. ECNU at SemEval-2016 Task 1: Leveraging word embedding from macro and micro views to boost performance for semantic textual similarity. In: Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016); 2016. .

73. Li X, Li J. Angle-optimized text embeddings. arXiv preprint. 2023.

74. Latif S, Qayyum A, Usman M, Qadir J. Cross lingual speech emotion recognition: Urdu vs. Western languages. In: 2018 international conference on frontiers of information technology (FIT); 2018. p. 88–93. https://doi.org/10.1109/fit.2018.00023

75. Datahub. InAra plagiarism detection corpus. Datahub; 2013.

76. Bensalem I, Rosso P, Chikhi S. Intrinsic plagiarism detection using n-gram classes. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP); 2014. .

77. Haseeb M, Manzoor MF, Farooq MS, Farooq U, Abid A. A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu. Data Brief. 2023;52:109857. pmid:38161660

78. Vasuteja A, Reddy AV, Pravin A. Beyond copy paste: Plagiarism detection using machine learning. In: 2024 international conference on inventive computation technologies (ICICT); 2024. p. 245–51. https://doi.org/10.1109/icict60155.2024.10544470

79. Patil R, Kadam V, Nakate R, Kadam S, Pattade S, Mitkari M. A novel natural language processing based model for plagiarism detection. In: 2024 international conference on emerging smart computing and informatics (ESCI); 2024. .

80. Manzoor M, Farooq M, Haseeb M, Farooq U, Khalid S, Abid A. Exploring the landscape of intrinsic plagiarism detection: Benchmarks, techniques, evolution, and challenges. IEEE Access. 2023;11:140519–45.

81. Albahra S, Gorbett T, Robertson S, D’Aleo G, Kumar SVS, Ockunzzi S, et al. Artificial intelligence and machine learning overview in pathology & laboratory medicine: A general review of data preprocessing and basic supervised concepts. Semin Diagn Pathol 2023;40(2):71–87. pmid:36870825

82. Mallikharjuna Rao K, Saikrishna G, Supriya K. Data preprocessing techniques: Emergence and selection towards machine learning models-a practical review using HPA dataset. Multimedia Tools Applic. 2023;82(24):37177–96.

83. Siino M, Tinnirello I, La Cascia M. Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on transformers and traditional classifiers. Inform. Syst. 2024;121:102342.

84. Rahimi Z, Homayounpour M. The impact of preprocessing on word embedding quality: A comparative study. Lang Resour Eval. 2023;57(1):257–91.

85. Chai C. Comparison of text preprocessing methods. Nat Lang Eng. 2023;29(3):509–53.

86. Babanejad N, Davoudi H, Agrawal A, An A, Papagelis M. The role of preprocessing for word representation learning in affective tasks. IEEE Trans Affect Comput. 2023;15(1):254–72.

87. Toraman C, Yilmaz EH, Şahi̇nuç F, Ozcelik O. Impact of tokenization on language models: An analysis for Turkish. ACM Trans Asian Low-Resour Lang Inf Process 2023;22(4):1–21.

88. Korablev Y, Loseva D, Lonchina A. Methods for preprocessing and classification of text data in question-answer information systems. In: 2024 international conference on information processes and systems development and quality assurance (IPS); 2024. p. 27–32.

89. Nazir S, Asif M, Rehman M, Ahmad S. Machine learning based framework for fine-grained word segmentation and enhanced text normalization for low resourced language. PeerJ Comput Sci. 2024;10:e1704. pmid:39669469

90. Archana S, Prakash J. An effective undersampling method for biomedical named entity recognition using machine learning. Evol Syst; 2024. p. 1–9.

91. Chavan T, Patil S. Named entity recognition (NER) for news articles. Development (IJAIRD). 2024;2(1):103–12.

92. Savci P, Das B. Structured named entity recognition (NER) in biomedical texts using pre-trained language models. In: 2024 12th international symposium on digital forensics and security (ISDFS); 2024. .

93. Frank E, Oluwaseyi J, Olaoye G. Data preprocessing techniques for NLP in BI. Springer; 2024.

94. Nafea A, Muayad M, Majeed R, Ali A, Bashaddadh O, Khalaf M. A brief review on preprocessing text in Arabic language dataset: Techniques and challenges. Babylonian J Artif Intell. 2024;2024:46–53.

95. Bharti SK, Gupta RK, Patel S, Shah M. Context-based bigram model for POS tagging in Hindi: A heuristic approach. Ann Data Sci. 2024;11(1):347–378.

96. Wikacek M, Rybak P, Pszenny L, Wroblewska A. NLPre: A revised approach towards language-centric benchmarking of Natural Language Preprocessing systems. arXiv preprint arXiv:240304507. 2024.

97. Mounica B, Lavanya K. Feature selection method on twitter dataset with part-of-speech (PoS) pattern applied to traffic analysis. Int J Syst Assur Eng Manag. 2024;15(1):110–123.

98. Wei C, Pang R, Kuo CCJ. GWPT: A green word-embedding-based POS tagger. arXiv preprint. 2024;240107475.

99. Boukhlif M, Hanine M, Kharmoum N, Noriega A, Obeso D, Ashraf I. Natural language processing-based software testing: A systematic literature review. IEEE Access. 2024.

100. Bozyigit F, Bardakci T, Khalilipour A, Challenger M, Ramackers G, Babur O. Generating domain models from natural language text using NLP: A benchmark dataset and experimental comparison of tools. Softw Syst Model. 2024:1–19.

101. Kutsenok L, Korablev Y. Research of applicability of natural language processing models to the task of analyzing technical tasks and specifications for software development. In: 2024 XXVII international conference on soft computing and measurements (SCM); 2024. p. 200–3.

102. Bourahouat G, Abourezq M, Daoudi N. Word embedding as a semantic feature extraction technique in Arabic natural language processing: An overview. Int Arab J Inf Technol. 2024;21(2):313–25.

103. Gorai J, Shaw DK. Semantic difference-based feature extraction technique for fake news detection. J Supercomput. 2024; p. 1–23.

104. Tavabi N, Singh M, Pruneski J, Kiapour AM. Systematic evaluation of common natural language processing techniques to codify clinical notes. PLoS One 2024;19(3):e0298892. pmid:38451905

105. Gupta A, Chadha A, Tewari V. A natural language processing model on BERT and YAKE technique for keyword extraction on sustainability reports. IEEE Access. 2024.

106. Xie R, Ahia O, Tsvetkov Y, Anastasopoulos A. Extracting lexical features from dialects via interpretable dialect classifiers. arXiv preprint. 2024;240217914.

107. Alfreihat M, Almousa O, Tashtoush Y, AlSobeh A, Mansour K, Migdady H. Emo-SL framework: Emoji sentiment lexicon using text-based features and machine learning for sentiment analysis. IEEE Access. 2024.

108. Ahanin Z, Ismail MA, Singh NSS, AL-Ashmori A. Hybrid feature extraction for multi-label emotion classification in English text messages. Sustainability 2023;15(16):12539.

109. Ksieniewicz P, Zyblewski P, Borek-Marciniec W, Kozik R, Choraś M, Woźniak M. Alphabet flatting as a variant of n-gram feature extraction method in ensemble classification of fake news. Eng Applic Artif Intell. 2023;120:105882.

110. Han X, Cui S, Liu S, Zhang C, Jiang B, Lu Z. Network intrusion detection based on n-gram frequency and time-aware transformer. Comput Secur. 2023;128:103171.

111. Hu M, Pan S, Li Y, Yang X. Advancing medical imaging with language models: A journey from n-grams to ChatGPT. arXiv preprint arXiv:230404920. 2023.

112. Das M, Alphonse P. A comparative study on TF-IDF feature weighting method and its analysis using unstructured dataset. arXiv preprint. 2023.

113. Makhmutova L, Ross R, Salton G. Impact of character n-grams attention scores for English and Russian News articles authorship attribution. In: Proceedings of the 38th ACM/SIGAPP symposium on applied computing. 2023:939–41. https://doi.org/10.1145/3555776.3577856

114. Reimer J, Schmidt S, Fröbe M, Gienapp L, Scells H, Stein B. The archive query log: mining millions of search result pages of hundreds of search engines from 25 years of web archives. In: Proceedings of the 46th international ACM SIGIR conference on research and development in information retrieval; 2023. .

115. Bakhteev O, Chekhovich Y, Grabovoy A, Gorbachev G, Gorlenko T, Grashchenkov K. Cross-language plagiarism detection: A case study of European languages academic works. In: Academic Integrity: Broadening practices, technologies, and the role of students: Proceedings from the European conference on academic integrity and plagiarism 2021; 2023. .

116. Ahmed T. Exploring mathematical models and algorithms for plagiarism detection in text documents: A proof of concept. Research Square; 2024.

117. Chang C, Jhang S, Wu S, Roy D. JCF: Joint coarse-and fine-grained similarity comparison for plagiarism detection based on NLP. J Supercomput. 2024;80(1):363–94.

118. Suljic A, Hossain MS. Towards performance improvement of authorship attribution. IEEE Access. 2024.

119. Zamir MT, Ayub MA, Gul A, Ahmad N, Ahmad K. Stylometry analysis of multi-authored documents for authorship and author style change detection. arXiv preprint arXiv:240106752. 2024.

120. He X, Lashkari A, Vombatkere N, Sharma D. Authorship attribution methods, challenges, and future research directions: A comprehensive survey. Information. 2024:15(3);131.

121. Nahar K, Alshtaiwi M, Alikhashashneh E, Shatnawi N, Al-Shannaq M, Abual-Rub M. Plagiarism detection system by semantic and syntactic analysis based on latent Dirichlet allocation algorithm. Int J Adv Soft Comput Applic. 2024;16(1).

122. Parmar S, Jain B. VIBRANT-WALK: An algorithm to detect plagiarism of figures in academic papers. Expert Syst Applic. 2024;252:124251.

123. Mittal S, Mishra A, Khatter K. Psquad: Plagiarism detection and document similarity of Hindi text. Multimedia Tools Applic. 2024;83(6):17299–326.

124. Johnson S, Murty M, Navakanth I. A detailed review on word embedding techniques with emphasis on word2vec. Multimedia Tools Applic. 2024;83(13):37979–8007.

125. Yang C. Learning word embedding with better distance weighting and window size scheduling. arXiv preprint. 2024.

126. Zeng Y, Li Z, Chen Z, Ma H. Aspect-level sentiment analysis based on semantic heterogeneous graph convolutional network. Front Comput Sci. 2023:17(6);176340.

127. Ameer I, Bölücü N, Sidorov G, Can B. Emotion classification in texts over graph neural networks: Semantic representation is better than syntactic. IEEE Access. 2023;11:56921–34.

128. Sousa RT, Silva S, Pesquita C. Explaining protein-protein interactions with knowledge graph-based semantic similarity. Comput Biol Med. 2024;170:108076. pmid:38308873

129. Wu Y, Pan X, Li J, Dou S, Dong J, Wei D. Knowledge graph-based hierarchical text semantic representation. Int J Intell Syst. 2024:2024(1);5583270.

130. Zhang J, Liu Z, Hu X, Xia X, Li S. Vulnerability detection by learning from syntax-based execution paths of code. IEEE Trans Softw Eng. 2023;49(8):4196–212.

131. Han D, Li Q, Zhang L, Xu T. A smart contract vulnerability detection model based on syntactic and semantic fusion learning. Wireless Commun Mobile Comput. 2023:2023(1);9212269.

132. Bouaine C, Benabbou F, Sadgali I. Word embedding for high performance cross-language plagiarism detection techniques. Int J Interact Mobile Technol. 2023;17(10).

133. Mitkov R. The Oxford handbook of computational linguistics. Oxford University Press; 2022.

134. Jayanth K, Mohan G, Kumar R. Indian language analysis with XLM-RoBERTa: Enhancing parts of speech tagging for effective natural language preprocessing. In: 2023 seventh international conference on image information processing (ICIIP); 2023. p. 850–4.

135. Nambiar K, Peter SD, Idicula MS. Abstractive summarization of text document in Malayalam language: Enhancing attention model using POS tagging feature. ACM Trans Asian Low-Resour Lang Inform Process. 2023;22(2):1–14.

136. Tehseen A, Ehsan T, Liaqat H, Ali A, Al-Fuqaha A. Neural POS tagging of Shahmukhi by using contextualized word representations. J King Saud Univ-Comput Inform Sci. 2023;35(1):335–56.

137. Zukharova U. Check for plagiarism using text mining. Texas J Multidiscip Stud. 2023;19:73–8.

138. Wahle J, Ruas T, Folty‘nek T, Meuschke N, Gipp B. Identifying machine-paraphrased plagiarism. In: Proceedings of the international conference on information; 2022. .

139. Pupovac V. The frequency of plagiarism identified by text-matching software in scientific articles: a systematic review and meta-analysis. Scientometrics. 2021;126(11):8981–9003.

140. El-Rashidy M, Mohamed R, El-Fishawy N, Shouman M. Reliable plagiarism detection system based on deep learning approaches. Neural Comput Applic. 2022;34(21):18837–58.

141. Ting C, Johnson N, Onunkwo U, Tucker JD. Faster classification using compression analytics. In: 2021 international conference on data mining workshops (ICDMW); 2021. p. 813–22.

142. Veisi H, Golchinpour M, Salehi M, Gharavi E. Multi-level text document similarity estimation and its application for plagiarism detection. Iran J Comput Sci. 2022;5(2):143–55.

143. Bensalem I, Rosso P, Chikhi S. On the use of character n-grams as the only intrinsic evidence of plagiarism. Lang Resour Eval. 2019;53(3):363–96.

144. Ríos-Toledo G, Posadas-Durán JPF, Sidorov G, Castro-Sánchez NA. Detection of changes in literary writing style using N-grams as style markers and supervised machine learning. PLoS One 2022;17(7):e0267590. pmid:35857768

145. Awale N, Pandey M, Dulal A, Timsina B. Plagiarism detection in programming assignments using machine learning. J Artif Intell Capsule Networks. 2020;2(3):177–84.

146. Costa VG, Pedreira CE. Recent advances in decision trees: An updated survey. Artif Intell Rev. 2023;56(5):4765–800.

147. Eppa A, Murali A. Source code plagiarism detection: A machine intelligence approach. In: 2022 IEEE fourth international conference on advances in electronics, computers and communications (ICAECC); 2022. .

148. Lemantara J, Sunarto M, Hariadi B, Sagirani T, Amelia T. Prototype of online examination on MoLearn applications using text similarity to detect plagiarism. In: 2018 5th international conference on information technology, computer, and electrical engineering (ICITACEE); 2018. .

149. Khan TF, Anwar W, Arshad H, Abbas SN. An empirical study on authorship verification for low resource language using hyper-tuned CNN approach. IEEE Access. 2023.

150. Alhijawi B, Jarrar R, AbuAlRub A, Bader A. Deep learning detection method for large language models-generated scientific content. arXiv preprint. 2024;240300828.

151. Kavatage A, Menon S, Devi S, Patil S, Shilpa S. Multi-model essay evaluation with optical character recognition and plagiarism detection. In: Intelligent communication technologies and virtual mobile networks; 2023. .

152. Bao W, Dong J, Xu Y, Yang Y, Qi X. Exploring attentive Siamese LSTM for low-resource text plagiarism detection. Data Intell; 2023. .

153. Altamimi A, Umer M, Hanif D, Alsubai S, Kim T, Ashraf I. Employing Siamese MaLSTM model and ELMO word embedding for Quora duplicate questions detection. IEEE Access. 2024.

154. Saeed A, Taqa A. A proposed approach for plagiarism detection in article documents. Sinkron. 2022;6(2):568–78.

155. Chang C-Y, Lee S-J, Wu C-H, Liu C-F, Liu C-K. Using word semantic concepts for plagiarism detection in text documents. Inf Retrieval J. 2021;24(4–5):298–321. https://doi.org/10.1007/s10791-021-09394-4

156. Jagtap D, Ambekar S, Singh H, Sharma N. An approach to detecting writing styles based on clustering technique. In: 2024 IEEE international students’ conference on electrical, electronics and computer science (SCEECS); 2024. p. 1–7.

157. Amaliah Y, Musu W, Fadlan M. Auto clustering source code to detect plagiarism of student programming assignments in Java programming language. In: 2021 3rd international conference on cybernetics and intelligent system (iCORIS); 2021. .

158. Saini A, Sri MR, Thakur M. Intrinsic plagiarism detection system using stylometric features and DBSCAN. In: Proceedings of the 2021 international conference on computing, communication, and intelligent systems (ICCCIS); 2021. .

159. Eppa A, Murali AH. Machine learning techniques for multisource plagiarism detection. In: 2021 IEEE international conference on computation system and information technology for sustainable solutions (CSITSS); 2021. .

160. Lee G, Kim J, Choi M, Jang R, Lee R. Review of code similarity and plagiarism detection research studies. Appl Sci. 2023:13(20);11358.

161. Sri S, Dutta S. A survey on automatic text summarization techniques. J Phys: Conf Ser.2021;1:012044.

162. Hafeez H, Muneer I, Sharjeel M, Ashraf M, Nawab R. Urdu short paraphrase detection at sentence level. ACM Trans Asian Low-Resour Lang Inform Process. 2023;22(4):1–20.

163. Wu H. Dilated convolution for enhanced extractive summarization: A GAN-based approach with BERT word embedding. J Intell Fuzzy Syst. 2024;46(Preprint):1–14.

164. He X, Shen X, Chen Z, Backes M, Zhang Y. Mgtbench: Benchmarking machine-generated text detection. arXiv preprint. 2023.

165. Solaiman I, Brundage M, Clark J, Askell A, Herbert-Voss A, Wu J. Release strategies and the social impacts of language models. arXiv preprint. 2019.

166. Fowler G. We tested a new ChatGPT-detector for teachers. It flagged an innocent student. Washington Post; 2023.

167. Almuqren L, Cristea A. AraCust: a Saudi Telecom Tweets corpus for sentiment analysis. PeerJ Comput Sci. 2021;7:e510. pmid:34084924

Word count: 18801

Show less

© 2025 Sajid et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

In text analysis, identifying plagiarism is a crucial area of study that looks for copied information in a document and determines whether or not the same author writes portions of the text. With the emergence of publicly available tools for content generation based on large language models, the problem of inherent plagiarism has grown in importance across various industries. Students are increasingly committing plagiarism as a result of the availability and use of computers in the classroom and the generally extensive accessibility of electronic information found on the internet. As a result, there is a rising need for reliable and precise detection techniques to deal with this changing environment. This paper compares several plagiarism detection techniques and looks into how well different detection systems can distinguish between content created by humans and content created by Artificial Intelligence (AI). This article systematically evaluates 189 research papers published between 2019 and 2024 to provide an overview of the research on computational approaches for plagiarism detection (PD). We suggest a new technically focused structure for efforts to prevent and identify plagiarism, types of plagiarism, and computational techniques for detecting plagiarism to organize the way the research contributions are presented. We demonstrated that the field of plagiarism detection is rife with ongoing research. Significant progress has been made in the field throughout the time we reviewed in terms of automatically identifying plagiarism that is highly obscured and hence difficult to recognize. The exploration of nontextual contents, the use of machine learning, and improved semantic text analysis techniques are the key sources of these advancements. Based on our analysis, we concluded that the combination of several analytical methodologies for textual and nontextual content features is the most promising subject for future research contributions to further improve the detection of plagiarism.

Details

Title

Comparative analysis of text-based plagiarism detection techniques

Author

Sajid, Muhammad

; Sanaullah, Muhammad; Fuzail, Muhammad; Tauqeer Safdar Malik

; Shuhaida Mohamed Shuhidan

First page

e0319551

Section

Research Article

Publication year

2025

Publication date

Apr 2025

Publisher

Public Library of Science

e-ISSN

19326203

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1371/journal.pone.0319551

ProQuest document ID

3187831465

Comparative analysis of text-based plagiarism detection techniques

Jump to:

Full text

Introduction

Literature review

Survey methodology

Ethics statement

Research objectives

Research questions

Research strategy

The process of selection

Quality assessment criteria

Results of study selection

Scrutiny of survey articles

Datasets

Pre-processing

Methods for extracting features

Lexical

N. grams.

Querying search engines.

Vector space model (VSM).

Stylometric features.

Semantic

Latent semantic analysis (LSA).

Word embeddings.

Graph-based semantic analysis.

Syntax

Syntactic.

POS tagging.

Plagiarism detection techniques

Traditional techniques.

Statistical techniques.

Distance based techniques.

Methods of supervised machine learning.

Methods for unsupervised machine learning.

Evaluation techniques

Findings and trends

Challenges in plagiarism detection

Conclusion

Future research directions

References

Abstract

Details