Content area
One of the challenges in the natural language processing is authorship identification. The proposed research will improve the accuracy and stability of authorship identification by creating a new deep learning framework that combines the features of various types in a self-attentive weighted ensemble framework. Our approach enhances generalization to a great extent by combining a wide range of writing styles representations such as statistical features, TF-IDF vectors, and Word2Vec embeddings. The different sets of features are fed through separate Convolutional Neural Networks (CNN) so that the specific stylistic features can be extracted. More importantly, a self-attention mechanism is presented to smartly combine the results of these specialized CNNs so that the model can dynamically learn the significance of each type of features. The summation of the representation is then passed into a weighted SoftMax classifier with the aim of optimizing performance by taking advantage of the strengths of individual branches of the neural network. The suggested model was intensively tested on two different datasets, Dataset A, which included four authors, and Dataset B, which included thirty authors. Our method performed better than the baseline state-of-the-art methods by at least 3.09% and 4.45% on Dataset A and Dataset B respectively with accuracy of 80.29% and 78.44%, respectively. This self-attention-augmented multi-feature ensemble approach is very effective, with significant gains in state-of-the-art accuracy and robustness metrics of author identification.
Introduction
Identifying the author is one of the oldest and, on the other hand, one of the most up-to-date issues in stylistics and information retrieval. Author identification is defined as an attempt to show the characteristics of the producer or writer of a piece of linguistic information, so that we can distinguish between different texts written by different people1. The first attempt to measure the writing style dates back to the nineteenth century. These studies were conducted by Mendenhall in 1887 on Shakespeare’s plays2. The rapid progress of internet communication and the anonymity of internet tools, such as email, blogs, and sites on the one hand, and legal disputes have caused more attention to the issue of author identification. Author identification is an attempt to obtain the characteristics of the author of a text and compare them with the characteristics of different texts written by several people, so that a meaningful distinction can be made between the texts. Author identification is a science that is a combination of linguistics and data mining. Linguistics is used to identify and obtain the features of the text, and data mining is used to perform computational and statistical processes with the aim of identifying the author3. According to the assumption of many researchers, people have a specific pattern for using language in their writings, which acts as a kind of fingerprint of the author and is called writeprint. With the expansion of the Internet and its borderless nature, as well as the increase in online communication, the need for author detection of security issues is increasing day by day. Applications of author identification include plagiarism (such as academic articles and books), identifying the author of inappropriate texts sent anonymously or under pseudonyms (such as e-mails or threatening letters), or solving historical questions about unknown disputed texts3, 4–5.
Identifying the author of the text is an important issue in fields including information retrieval and linguistics. It is also very important in practical fields such as legal prosecution of a text and journalism, where finding the author of the text may save human life. On the other hand, identifying the author is one of the most up-to-date issues in stylistics and information retrieval. Author identification is an attempt to show the characteristics of the producer or distinguish different texts with different authors. As the whole author is a piece of linguistic information, it can be said that in identifying the author, the goal is to identify the author of a written text6,7. The general framework of all proposed solutions to solve this problem is the use of text classification3. In general, it can be said that the identification of the author is such that the beginning and specific features of texts with a specific author are extracted and stored in the form of number vectors. Then the same features are extracted from the text that does not have a specific meaning and stored in another numerical vector. Finally, these vectors are compared with each other to select the most likely author8,9. As it was said, to identify the author, a set of text features should be selected and considered as criteria for review and comparison. There are different criteria for converting text into numerical vectors and comparing texts, such as the length of sentences, the number of words, the repetition rate of words and characters, and dictionary richness functions10. In general, identifying the author of texts is defined as an attempt to show the characteristics of the producer or author of a piece of linguistic information, so that different texts written by different people can be distinguished. Identifying the author is based on the prose style and writing style, or in other words, the hidden features of the texts written by them. The composition of writing features, such as the length of words, the arrangement of paragraphs, the richness of vocabulary, the use of functional words, etc. Through these characteristics of cognitive style, which are usually constant during the writing of a person, the identity of the author of the texts is identified. In presenting the methods of recognizing the author of texts, it is important to choose key features and remove additional and unrelated features in recognizing the author of texts11, 12–13.
In the following, a number of studies related to the identification of the author are discussed based on the patterns and writing style used in the text. Qian et al. (2017) used different deep learning (DL) models and different datasets to identify authorship from textual data. The results showed that the Gated Recurrent Unit (GRU) had the best performance and the highest accuracy in identifying authors in two datasets14. Vijayakumar and Fuad (2019) used a new approach based on a combination of machine learning (ML) techniques and natural language processing (NLP) to identify short text authors. The results of this study showed that ML techniques have a high ability to solve problems on a small scale5. Benzebouchi et al. (2019) conducted a case study on an English text dataset and found that the MLP classifier in combination with word2vec has an accuracy of 95.83%. The results showed that using the word2vec word embedding model has a significant effect on improving the recognition accuracy compared to other classical models such as n-gram frequencies and bag of words15. Ma et al. (2020) showed in a study that DL methods in combination with methods for extracting stylistic features have appropriate and acceptable accuracy16.
Stoean and Lichtblau (2020) proposed a model based on the Chaos Game Representation method to represent documents such as images, which is subsequently performed using a DL algorithm for data classification and author identification. The results of the case study confirmed the accuracy of the proposed approach17. Saedi and Dras (2021) in a study showed that Siamese networks are the most accurate in identifying the author in a large-scale dataset review. The results of this study showed that the mentioned methods are more accurate than the traditional methods based on DL18. Alhuqail (2021) uses Bag of Words (BOW) and Latent Semantic Analysis (LSA) methods to extract features, as well as support vector machine (SVM) methods, random forest, Bidirectional Encoder Representations from Transformers (BERT), and logistic regression as a classifier. The results showed that the BOW method is better than LSA in all algorithms. The results showed that the BERT model was the most accurate for the first range of the dataset, and the logistic regression model was the most accurate for the second range of the dataset6.
YÜLÜCE and DALKILIÇ (2022) investigated the problem of author identification from textual data using ML algorithms. By examining three different datasets, they showed that using the Term frequency-inverse document frequency (TF-IDF) method to identify features and using Stochastic Gradient Descent (SGD) and Logistic Regression (LR) methods, the highest accuracy is achieved19. Abbasi et al. (2022) used a new approach based on group learning, DistilBERT, and conventional ML techniques to investigate the problem of author identification from text. The results showed that the proposed group learning approach, in integration with the TF-IDF method to extract features, has acceptable and appropriate accuracy20. Tang (2024) used a DL method to build an automatic text feature extraction model and identify the author of Chinese texts. In the proposed model, CNN and Attention algorithms were used as text feature extractors, as well as LSTM and Softmax, Window Feature Sequence methods for data classification21.
Demšar (2006) addressed the gap in statistical methods for comparing multiple learning algorithms across various datasets, recommending non-parametric tests like Wilcoxon signed-rank and Friedman with critical difference diagrams for robust analysis22. While previous research criticized null hypothesis testing (NHST) in machine learning, Benavoli et al. (2017) proposed alternative non-parametric tests for comparing multiple algorithms across datasets, acknowledging the limitations of NHST23.
Large Language Models (LLMs) have made significant progress in recent years. Huang et al. (2024)24 studied the application of these models in the problem of author recognition. The results of these investigations show the significant performance of LLMs in solving the problem of author recognition. However, the significant complexity of these models and the high level of computing resources required in them have limited their use. The study by Gargiulo et al. (2022)25 suggests the potential benefits of employing Neural Language Models (NLMs) like ELECTRA for various NLP tasks. Author identification could be one of the applications that utilize NLMs could be useful.
Korbak et al. (2023) present a way of training LLMs with human feedback to produce text that is more aligned with human values. This leads to the idea of changing the LLM training from imitation learning to the direct inclusion of human values26. Trotta et al. (2021) present the ItaCoLA corpus, which is an Italian sentence set with acceptability ratings based on the English CoLA corpus. Their work suggests the need for more multilingual resources for evaluating LLMs27. Zhang et al. (2023) propose MELA (Multilingual Evaluation of Linguistic Acceptability), the first benchmark for LLMs that includes multiple languages to assess the models’ ability to handle linguistic acceptability. Their studies focus on the cross-lingual transfer and multi-task learning; the results show that in-language training data is crucial for accurate acceptability judgments28.
Wen et al. (2024) introduced a benchmark to evaluate the performance of LLMs in identifying authors and named that AIDBench29. This research considered one-to-one and one-to-many scenarios of author identification using LLMs. The first scenario determines if two text are related to the same author; and the second scenario related a text to a collection of texts belonging to an author. This evaluation demonstrates the capability of LLMs in identifying authors. Setzu et al.30 investigated the application of eXplainable AI (XAI) in authorship identification. This research is an attempt to describe the reasons of attributing each text to an author. They examined the application of three XAI methods (factual and counterfactual selection, probing, feature ranking) in three scenarios (verification of same authorships, authorship verification, and author identification).
Machtle et al. (2025) investigated the problem of code authorship and introduced a model named OCEAN which is a function-level author attribution based on contrastive learning31. It is the first attempt on introducing framework for exploring code authorship attribution on compiled binaries. In this scenario, two code instances from unknown authors are compared to determine if they are developed by the same author. Gaviria de la Puerta et al. (2024) presented an approach for identification of authors based on content in social networks32. Win (2024) proposed a deep learning-based model for identifying authors33. This model used Word2Vec for description of textual features and then employs a deep neural network for classifying them and identifying authors.
Boukhaled (2022) had a study on the application of machine learning techniques in authorship identification for classical Arabic34. The author studied several feature extraction approaches including TF-IDF, character-level and word-level n-grams. Also, the efficiency of three machine learning models (logistic regression, naïve bayes, and k-nearest neighbors) in identification of authors was examined. The results demonstrated the superiority of k-nearest neighbors; however, it seems that using more efficient feature extraction approaches could improve the efficiency of the model. Abbas et al. (2023) studied the application of active learning and machine learning in identifying the author of unknown news text35. This model employs TF-IDF for description of textual features and then uses several active learning-based machine models for identifying the author. The studied models are Multilayer Perceptron, XGBoost, Random Forest, and Logistic Regression. The findings show the higher accuracy of the XGBoost model in combination with active learning for identification of authors. Table 1 summarizes the studied works.
Table 1. Summary of the literature.
Reference | Year | Method | Limitation |
|---|---|---|---|
Vijayakumar & Fuad5 | 2019 | ML & NLP combination | Not suitable for long texts |
Alhuqail6 | 2021 | BOW, LSA, & various classifiers | Limited exploration of advanced DL models |
Qian et al.14 | 2017 | DL models | Limited dataset variety |
Benzebouchi et al.15 | 2019 | MLP + word2vec | Lacks exploration of advanced DL techniques |
Ma et al.16 | 2020 | DL + stylistic feature extraction | Doesn’t specify chosen DL models |
Stoean & Lichtblau17 | 2020 | Chaos Game Representation + DL | Limited to case studies, needs broader validation |
Saedi & Dras18 | 2021 | Siamese Networks | Comparison with other DL models not explored |
YÜLÜCE & DALKILIÇ19 | 2022 | TF-IDF + SGD/LR | Lacks exploration of more recent feature extraction methods |
Abbasi et al.20 | 2022 | Group learning + TF-IDF + ML | Limited exploration of advanced DL models with group learning |
Tang21 | 2024 | CNN, Attention, LSTM, Softmax | Lacks comparison with other language models |
Huang et al.24 | 2024 | LLMs | High complexity and computational resources |
Wen et al.29 | 2024 | LLMs evaluation for one-to-one and one-to-many scenarios | Focuses on evaluating LLM capabilities rather than proposing a novel authorship identification model |
Setzu et al.30 | 2025 | XAI methods and machine learning | Focuses on explainability of authorship identification, primarily in cultural heritage contexts |
Mächtle et al.31 | 2025 | function-level author attribution based on contrastive learning | Primarily addresses code authorship in compiled binaries which may not directly translate to general text-based author identification |
Gaviria de la Puerta et al.32 | 2024 | Content-based authorship identification for social networks | The content-based aspect implies less emphasis on stylistic features |
Win33 | 2024 | Word2Vec + DL | Limited Feature Diversity |
Boukhaled34 | 2022 | ML models + TF-IDF and n-grams for classical Arabic | Language-specific & Feature/Model Limitations |
Abbas et al.35 | 2023 | Active Learning and ML | Feature & Domain Limitations, potentially overlooking other valuable stylistic features |
The review of author identification studies shows that although many studies have been conducted in this field, there are limitations in these studies. In most of these studies, traditional methods based on feature extraction and selection have been used to identify the author. In addition, in most of these studies, different environmental conditions have not been considered in the preparation of different databases. On the other hand, in various studies, the ability and capabilities of methods based on DL and the necessity of using a precise method in identifying patterns and writing style have been emphasized. The high variety of patterns in the problem of author identification based on writing style has made it impossible to settle on a specific set of features or a single learning model to solve the problem of author identification. This research aims to address the identified gaps by employing a weighted ensemble of CNNs. This combines the strengths of different CNN architectures for robust feature extraction. Also, the proposed model incorporates linguistically specific metrics with TF-IDF and word2vec features for providing a better representation of the textual content of the literary works. A combination of three CNNs as a weighted hybrid model was used to form a classifier for more accurately identifying the author based on the writing style. The novel contribution of this research includes:
Combining the individual capabilities of CNN models in the form of an ensemble learning model to achieve higher accuracy in identifying authors.
Making decisions in an ensemble learning system based on the weighted combination of CNN models’ outputs using self-attention mechanism and SoftMax classifier.
Improving generalizability in author identification systems with multiple combinations of feature description methods (statistical, TF-IDF, and word2vec features).
The article is organized as follows: The introduction is given in the first section. In Sect. “Research Methodology”, the research methodology, including data introduction and details of the proposed method, is presented. In Sect. “Results and Discussion”, the research results were presented, and in Sect. “Conclusion”, the conclusion was presented.
Research methodology
Data
In this research, two different data sets are used to evaluate the performance of the proposed author identification model. The first dataset includes the literary works of four different authors. Each author in this dataset has at least 100 text samples, each of which has a minimum length of 8000 characters and includes all kinds of punctuation marks in addition to numbers and letters. These samples include parts of literary works published by the author and consist of novels, essays, and prose written by him. This dataset, hereafter referred to as Set A, contains a total of 420 text samples. In contrast, the second dataset covers the works of 30 authors. The number of text samples in this data set is 900; this means that 30 samples are available for each author. Each sample of this dataset includes parts of a novel, article, prose, poem, or author’s report. Each sample contains at least 1500 characters. In the following, this dataset is called Set B. The larger number of authors, the smaller number of samples, and the lower average length for each sample in Set B make the process of author detection for this dataset more challenging than Set A; and based on that, the performance of the proposed method can be checked in conditions closer to real world scenarios.
The proposed method
The high diversity of patterns in the problem of author identification based on writing style has made it impossible to be satisfied with a specific dataset of features or a single learning model to solve this problem. For this reason, in this research, an attempt has been made to achieve an author identification system with higher accuracy and generalizability by using different feature description techniques and also by using the combination of several learning models. The proposed method includes the following basic steps:
Pre-processing of texts
Representing the features of texts
Classification
This mechanism is shown as a diagram in Fig. 1.
Fig. 1 [Images not available. See PDF.]
Diagram of the steps of the proposed method.
According to the process shown in Fig. 1, the set of input texts is first pre-processed, and each text document is divided into its constituent sentences. After the pre-processing operation, three different strategies are used to represent the features of the text. These methods include:“statistical features”,"TF-IDF technique", and“word2vec technique”. In order to describe the statistical features, the grammatical role of the words that make up each sentence in the input texts is identified, and based on the frequency of the identified tags, the set of statistical features of the text is extracted. On the other hand, in order to extract TF-IDF vectors, first, the keywords in the text are extracted, and then the weight vector of the keywords extracted from the text is calculated in the form of a vector. Finally, word2vec feature vectors will be extracted based on the content of each input text. Each of the resulting feature vectors is used as the input of a dedicated convolutional neural network model, whose task is to try to identify the author based on the received features. Three separate CNN models form the proposed ensemble architecture. The model configuration consisting of three distinct models matches the three different feature representation approaches used in this investigation which include statistical features and TF-IDF vectors and Word2Vec embeddings. The dedicated input feature type of each CNN establishes its specialization for independent authorship pattern learning (Fig. 1 shows this concept). The system utilizes separate models to maximize the advantages of multiple distinct feature types before performing the final integration step. The research implements three models from selected features but the framework remains adaptable to different model numbers based on new feature integration. The end of each CNN model includes a fully connected layer and a classification layer; the outputs of these two layers are used by the ensemble learning model. For this purpose, the activation values obtained from the last fully connected layer (shown as FCi in Fig. 1), as well as the training accuracy corresponding to each target author (shown as ECi in Fig. 1) for each CNN model were merged together to obtain the output of each model as a vector. Finally, the output vectors of the CNN models are fused using the self-attention mechanism to make a decision through a SoftMax classifier in the field of aggregating the results and determining the author of the input text.
Text preprocessing
The proposed method prioritizes data preparation. This initial step involves transforming raw text data into a format suitable for further analysis. This transformation entails deconstructing the text into its fundamental constituents: paragraphs, sentences, and individual words. Given that sentences represent the smallest units conveying meaning, the segmentation process focuses on them. This process starts with standardizing the text, which involves correcting inconsistencies in spaces, symbols, and common punctuation usage. This is achieved by consolidating consecutive spaces, newlines, and punctuation marks. This crucial step forms the foundation for subsequent processing stages. For this essential preprocessing stage, the Text Analytics Toolbox™ tool is employed. The pre-processing step ends with the modification of specific entities in the texts. Special entities are parts of the text whose presence can reflect a specific concept. These special entities are: numbers, time, email address, currency, and web address. In this way, each of the above entities in the texts is replaced by a keyword. For this purpose, numbers and time are replaced by keywords NumberKey and TimeKey, respectively. Also, the e-mail address, currency, and web address are replaced by the keywords emailKey, currencyKey, and webKey, respectively.
Representing the features of texts
In the second phase of the presented approach, three different techniques are used to represent the features of pre-processed texts. These methods include:“statistical features”,"TF-IDF technique", and“word2vec technique”. The purpose of using the combination of these methods is to achieve a more complete feature description model to cover the diversity of people’s writing patterns. In this way, each of the mentioned techniques separately describes the content features of the texts, which will be explained in the rest of this section.
Extracting the statistical features of the text
The first set of features used in the proposed method for describing texts is statistical features. For this purpose, the sentences identified in the texts are processed, and the grammatical role of the words in the text is identified. After determining the grammatical role of the words for all the sentences in the text, a labeled text document will be obtained, using which a set of statistical features will be extracted to describe the features of the texts. The set of statistical features describing each text is:
Average length of each sentence in terms of words: In order to calculate this feature for each text document, first, the number of words in the text document is counted, and this number is divided by the total number of sentences in the text.
Average number of ineffective words in each sentence: To calculate this feature, the total number of ineffective words in a text document is divided by the number of sentences in it.
Frequency rate of each grammatical role in the text: this feature is calculated separately for each possible grammatical role, such as noun phrase, present phrase, adverb, adjective, etc. In this case, for each grammatical role, such as x, the number of phrases that have the grammatical role x in the text is calculated and divided by the total number of phrases in the text.
The rate of using punctuation marks in the text: in this feature, the number of punctuation characters in the text is divided by the number of sentences.
Average length of each word: This feature shows that, on average, each word in the input text consists of several characters.
By extracting the set of the above features, a numerical vector representing the statistical features of the text is formed for each text document. These features are used by one of the CNN components in the proposed ensemble learning system.
Extracting TF-IDF features from keywords
The second category of descriptive text features is TF-IDF vectors, which describe the importance of words in the text based on their frequency. This feature description strategy starts by extracting keywords from the text. For this purpose, first, the action of removing ineffective words and finding the roots of the words in the text is done. The purpose of removing ineffective words is to identify and filter terms like“as”,“into”,“because”, etc., which do not convey any special meaning in the text. Also, the high frequency of these words in the texts may cause errors in the detection of keywords. For this reason, these words will be ignored in the input text. On the other hand, every word can receive different prefixes and suffixes; it takes different forms, and as a result, two words with the same root will have different forms, while both of these words convey a special meaning. To solve this problem, the word stemming solution is used. For this purpose, each non-stop word in the text is stemmed to reduce the dimensions of the vector of terms, or in other words, the vector of features. By counting the number of unique words in each text document, it is possible to determine K keywords with the most repetitions. In this way, K root words with the highest number of repetitions are considered as keywords of each document. After extracting keywords for each text, a list such as F = {w1, …, wn} containing unique words of the dataset is constructed. This list contains the entire set of database keywords and is used to describe any text document. In the next step, TF-IDF is used to describe each document as a vector. In this mechanism, the TF-IDF weight of term w is calculated as36:
1
In (1), TF refers to the count of appearances of w in the document, and N is the number of dataset documents. Also, Nw refers to the number of documents containing w. In the presented approach, the obtained weight value for each keyword in each document is stored as a feature vector W. The resulting TF-IDF vectors are used to train the second component of the CNN in the proposed aggregation system.
Extraction of word2vec features
The third category of features is word2vec to describe the content features of texts. The Word2Vec model is a language model that can learn the meaning of words through an automatic learning process. This model can convert words into vectors that contain their semantic features. In order to convert text to vector using Word2Vec, the Skip-Gram technique is used in the proposed method. In the Skip-Gram method, the model tries to predict the words around it using a word. In this method, the model considers a vector for each word. This vector contains the semantic features of the word. The steps of character description using Word2Vec are as follows37:
Teaching the Word2Vec model
Convert text to vector using the Word2Vec model
In the proposed method, Text Analytics Toolbox™ tool and a fastText-trained model are used. The output of this model will be a vector for each text document, which will be used as the input of the third CNN component in the proposed ensemble learning system.
Classification of features and recognition of the author
After describing the statistical features, TF-IDF and word2vec for the texts, each of these features is used by one of the CNN models constituting the proposed ensemble learning system to identify the writing patterns of each author. This integrated system integrates the local detection results of CNN components so that the weighted combination of these outputs achieves a more accurate identification. In the following, the structure of the learning components in the presented ensemble learning system will be described first, and then the method of aggregating the results and the author’s identification will be described.
CNN components structure in the presented ensemble learning system
After converting the text into formats that can be processed by convolutional neural networks, each of these formats is used as input to one of the three CNN models of the proposed aggregation system. The first CNN model uses text statistical features, the second model uses TF-IDF features, and the third model uses word2vec features. The CNN models used in the proposed system are similar in terms of the number and order of layers, but the configuration of the parameters of each layer is different. The structure of the layers of each of these models is shown in Fig. 2.
Fig. 2 [Images not available. See PDF.]
Architecture of proposed CNN models.
Each CNN model consists of three one-dimensional convolution blocks. In each block, after the feature maps are extracted by the two-dimensional convolution layers, an activation function is used to transfer the features to the next layers. Each convolution block ends with a pooling layer that reduces the dimensionality of the feature maps. At the end of each CNN model, a fully connected layer and a classification layer are used to determine the target category. The BayesOpt tool is used to configure the parameters of each CNN model. In this process, an attempt has been made to select a structure with the lowest training error on the training data. The configuration result of the CNN models is given in Table 2.
Table 2. The configuration determined in each of the layers of the three CNN models used in the proposed ensemble learning system.
Layer | Statistical Features | TF-IDF Features | Word2vec Features |
|---|---|---|---|
Convolution1 (Dim, N) | (5, 8) | (15, 12) | (20, 18) |
Activation 1 | Sigmoid | ReLU | ReLU |
Pooling1 (Dim) | Average (2) | Max (4) | Max (6) |
Convolution2 (Dim, N) | (4, 12) | (9, 24) | (13, 48) |
Activation 2 | ReLU | ReLU | ReLU |
Pooling2 (Dim) | Average (2) | Max (2) | Max (4) |
Convolution3 (Dim, N) | (3, 16) | (7, 32) | (5, 64) |
Activation 3 (Dim) | Sigmoid | Sigmoid | Sigmoid |
Pooling3 | Average (2) | Max (2) | Max (2) |
Based on the configurations presented in Table 2, in general, as the depth of the model layers increases, the dimensions of the convolution filters decrease, and the number of filters necessary for feature extraction in each layer increases. After training these three CNN models on the training data of the database, the weighted aggregation strategy is used to determine the final output of the system and recognize the author.
Aggregation of results and identifying the author
The problem of identifying the author by cumulative learning strategies is associated with two basic challenges. First, the problem of author detection in most application scenarios includes multiple target classes; In this case, in order to use ensemble learning models, the number of classifiers should be increased so that the classifiers can reach a consensus in determining the labels of the samples, which will result in a significant increase in the computational load. Secondly, due to the high similarity of the target categories (in terms of some authors following similar writing styles) in the mentioned problem, the classification models used in an ensemble learning system can reflect significant functional differences. These two features have made it impossible to use basic strategies such as majority voting in this issue. For this reason, instead of using classic aggregation strategies such as majority voting, the proposed method uses an ML model to combine the output of DL models. This strategy increases the flexibility of the proposed ensemble learning model in the combination of local decision-making of partial classifications, and on the other hand, eliminates the need to increase the number of classifications.
In the proposed aggregation strategy, the output of each CNN model will be described through two fully connected (FC) weight vectors, and model training accuracy for each class (EC). Assuming the existence of C target classes (authors), each of these vectors will have C length. The FC vector obtained from each CNN model describes the sample output weight for each target class in the form of a vector. Based on the values of this vector, the output value of the CNN model for the input sample can be described. On the other hand, the EC vector shows the training accuracy of the CNN model for each target class. In other words, the value in the ith place of this vector shows the accuracy of this model in the training phase for class i samples. This criterion can be calculated for target class i as follows:
2
In (2), TP indicates the count of training instances belonging to class i for which the output of the CNN model is correct. On the other hand, FP indicates the number of training samples that are actually among other target categories, but the CNN model has recognized its category i. Rather than a regular concatenation, the three CNN models’ outputs are merged through a self-attention mechanism to intelligently merge the respective feature representations. In each CNN model, the FC vector (the classification weights per class) and the EC vector (the training accuracy per class) are then concatenated into one single and comprehensive representation vector of that particular CNN. These three individual representation vectors (each of them is generated by a different CNN) are then passed into the self-attention layer. The scaled dot-product attention used has the general form:
3
In (3), Q, K, and V are the Query, Key and Value matrices, respectively, based on the input representation vectors of the CNNs and d k is the dimension of the keys. This self-attention mechanism can dynamically learn the relative significance of each CNN output to a specific input text and thus allows a more dynamic and subtle combination of the various stylistic features. Namely, the self-attention layer computes the attention weights of the output vectors of each CNN according to their relevance to the entire task. This weighted sum gives a very discriminative, fused feature vector that is able to capture the most salient information in all three modalities of features (statistical, TF-IDF, and Word2Vec). This dynamically weighted vector, that explains the sum of the decisions and educational performance of the models, is then fed into a SoftMax classifier to reveal the final author of the input text.
Results and discussion
In this section, the research results are reviewed and analyzed. Thus, after describing the experimental details and presenting the results, the research findings are discussed and statistically analyzed.
Implementation details
The proposed method was implemented using MATLAB 2020a software. Also, the k-fold cross-validation method (k = 10) was used to generalize the proposed method on the datasets A and B. In this scenario, the experiment was repeated 10 times. The division of training and test data in each iteration was equal to 90% and 10%, respectively. The evaluation indices to check the performance of the models include Accuracy, Precision, Recall, and F-Measure. During the experiments, the performance of the proposed approach was compared with four different methods, including Alhuqail6, Stoean et al17., Tang21 and Win33 models. The models introduced in6,17,21,33 are approaches in identifying authors and reviewed in Sect. “Introduction”. Additionally, in order to demonstrate the effectiveness of employing self-attention mechanism in fusion of CNN features, this layer was replaced with a conventional concatenation operator and the results were evaluated. This scenario has been shown as “concat (non-ATM)” in this section. As mentioned before, in this research, two different data sets are used to evaluate the performance of the proposed author identification model. Therefore, the results were also presented separately for these data sets.
Results
The experiments were performed using a tenfold cross-validation. Figure 3 shows the average Accuracy values of the proposed method compared to other methods. According to this figure, it can be seen that the model proposed in this study has the highest values of this index compared to others in both datasets. Therefore, it has been able to identify authors more accurately than others. After this model, Alhuqail6, Stoean et al17., Tang21, and Win33 models are respectively in the next ranks of the model with the highest accuracy and the lowest classification error in both datasets. The values of the average Accuracy related to the proposed model are equal to 80.29% and 78.44% in Set A and Set B, respectively, which shows a 2.61% and 2.56% improvement in classification accuracy compared to the simple concatenation case “concat. (non-ATM)”, respectively. This, proves that using self-attention mechanism in fusion of extracted features can improve the overall accuracy of the identification model.
Fig. 3 [Images not available. See PDF.]
Comparing the models based on the average Accuracy index.
Figure 4 shows the bar plots related to the performance of the models based on the Precision, Recall, and F-Measure indices.
Fig. 4 [Images not available. See PDF.]
Comparing the models based on Precision, Recall, and F-Measure indices.
According to Fig. 4, it can be seen that the proposed model in this study has the highest index values in both datasets, and as a result, has the highest accuracy compared to others. After the proposed model, the Alhuqail6, Stoean et al17., Tang21, and Win33 models are respectively in the next ranks of the model with the best classification quality. Also, the non-ATM case has the lower classification rates compared to the proposed attention-based approach. As a result, employing the attention mechanism in our approach has led to a multi-modal classification model with higher quality.
Figure 5 shows the confusion matrix for all models based on Set A. In each matrix, the rows represent the output class and the columns represent the target class. According to this figure, it can be seen that the proposed model had the best performance in classifying and identifying authors and real labels. According to this figure, 80.29% of the data correctly recognized the authors based on the proposed method, while only 77.2% of the data were correctly classified based on the Alhuaqail6 method, as the model with closest performance.
Fig. 5 [Images not available. See PDF.]
The confusion matrix related to all models based on Set A.
Figure 6 shows the confusion matrix for all models based on Set B. According to this figure, based on Set B, the proposed method has performed better than others in identifying authors. In such a way that 78.44% of the data based on the proposed method correctly recognized the authors, while 75.89% of the data were correctly classified if we replace the proposed attention-based fusion mechanism with simple concatenation.
Fig. 6 [Images not available. See PDF.]
The confusion matrix related to all models based on Set B.
Next, as shown in Fig. 7, the performance of the models in the classification of positive category samples by each author was compared using receiver operating characteristic (ROC) curves. In this chart, the vertical axis corresponds to the TPR (True Positive Rate), and conversely, the horizontal axis represents the FPR (False Positive Rate). Given that it has the greatest TPR value and the lowest FPR value, the point on the graph with coordinates of (0, 1) in this figure will perform the best in terms of classification. The ideal categorization is represented by this point.
Fig. 7 [Images not available. See PDF.]
Performance of all models based on ROC curves.
According to Fig. 7, in both datasets, the curve related to the proposed method is in a higher position than other models. Therefore, the ROC analysis also confirms the superiority of the proposed model over other algorithms. After this model, the Alhuqail6, Stoean et al17., Tang21, and Win33 models are placed in the next ranks of the best performance, respectively. The proposed method has been effective in reducing FPR and increasing TPR for both datasets evaluated at the same time. This means that the proposed method has been able to perform the identification tasks separately for each author more accurately. A higher value of TPR means that the proposed method has correctly identified a higher rate of texts related to each author, and a lower value of FPR means that in the proposed method, for each target author, a smaller number of texts related to other authors are assigned to this author. Table 3 shows the values of the evaluation indices related to all models based on Set A and Set B.
Table 3. Performance of all models based on evaluation indices.
Precision | Recall | F-Measure | Accuracy | ||
|---|---|---|---|---|---|
Set A | Proposed | 80.1555 | 80.6873 | 80.2274 | 80.2850 |
Win33 | 72.6871 | 72.8980 | 72.7108 | 73.1591 | |
Tang21 | 75.7105 | 75.9562 | 75.7410 | 76.0095 | |
Stoean et al.17 | 75.9620 | 76.3695 | 75.9366 | 76.0095 | |
Alhuqail6 | 76.7729 | 76.6478 | 76.6981 | 77.1971 | |
Concat (non-ATM) | 77.5580 | 77.8496 | 77.5844 | 77.6722 | |
Set B | Proposed | 78.8435 | 78.4444 | 78.3427 | 78.4444 |
Win33 | 66.4981 | 66.1111 | 66.1260 | 66.1111 | |
Tang21 | 69.2214 | 68.8889 | 68.7087 | 68.8889 | |
Stoean et al.17 | 72.7230 | 72.3333 | 72.2623 | 72.3333 | |
Alhuqail6 | 74.6362 | 74.4444 | 74.3868 | 74.4444 | |
Concat (non-ATM) | 76.4441 | 75.8889 | 75.8493 | 75.8889 |
Statistical analysis
Statistical analysis can provide a more detailed study of the performance of the proposed method and its significance compared to the baseline methods. For this purpose, one-way Analysis of Variance (ANOVA) was used. In this test, the predicted labels of the proposed model and the baseline methods are organized in a matrix format where each row of this matrix corresponds to one of the test samples and each column corresponds to the proposed method or one of the baseline methods. Then, by comparing each output with the ground-truth label of the sample, the accuracy values for each test sample are calculated in this matrix. In this case, matching the output with the context label is described as + 1 and non-matching as −1. A normality test for each column of the accuracy matrix was done using a Quantile–Quantile diagram. The results of this test showed that the distribution of each column of the model is normal and thus, one-way ANOVA can be used for its statistical analysis.
Performing the test resulted in a significant effect ( ), based on which it can be concluded that the accuracy of at least two methods has a statistically significant difference ( ). Since the one-way ANOVA test cannot determine which model this difference comes from, therefore, for the purpose of deeper analysis, a multiple comparison test was used. The results of this test are shown in Fig. 8, based on which the proposed method shows a significant increase in average accuracy compared to the Win33, Tang21, and Stoean et al17.; But on the other hand, the proposed method is not significantly different from Alhuqail et al6. in terms of accuracy.
Fig. 8 [Images not available. See PDF.]
The result of the multiple comparison test on the accuracy of the models.
Limitations and future directions
While the proposed multi-feature ensemble approach demonstrates significant improvements in authorship attribution accuracy on the evaluated datasets, the current study has certain limitations that open avenues for valuable future research. These aspects will help increase the stability and practicality and knowledge base of deep learning models that operate in this domain. The main limitations of this research are as follows:
Dataset scope and generalizability: The performance assessment in this work used two particular datasets known as Set A and Set B. The datasets used for this study enabled controlled experiments and state-of-the-art method comparisons but they lack size scalability as well as diversity in author count (especially Set A) and text genre variety (mainly literary) and language (only English). The model’s applicability in various contexts needs additional testing before it can be used for diverse author pools along with different genres including emails, social media, technical reports, historical documents, and multilingual texts. The model performance depends on particular features present in training and testing datasets.
Model interpretability: The current model configuration using multiple CNNs has achieved high accuracy while keeping its operational processes hidden from interpretation. The model does not contain an automatic process to show which stylistic features or text sections lead to authorship determination. Applications dependent on critical understanding experience barriers because of this deficiency in explaining their decision-making process.
Computational cost: The computational expense of ensemble methods becomes particularly high when deep neural networks such as CNNs are used together in the process. The training step requires substantial time investment along with multiple resources and the inference process presents difficulties when running real-time applications or deploying on limited devices.
Future research should focus on the following directions after understanding the observed results along with their limitations:
Enhancing generalizability through diverse datasets: Future research must evaluate the proposed model by applying it to extensive diverse datasets for improving its generalization capabilities. The model testing should be conducted across multiple genres as well as across different languages and various time periods to strengthen its robustness for real-world deployment.
Improving model explainability: The future development of this project must focus on improving model explainability using Explainable AI (XAI) techniques. Future research must examine the combination of SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) analytical methods that will explain the model’s decision-making system. Investigating how the ensemble handles and combines feature importance information from each CNN helps reveal which stylistic differences the model has learned while enhancing its overall trustworthiness.
Optimizing computational efficiency: The implementation of practical solutions requires an optimization of computational efficiency. Studies should investigate model optimization through network pruning together with parameter quantization and knowledge distillation to develop lightweight but effective ensemble variations. The combination of less resource-hard deep learning architectures with structure optimization of the ensemble model represents productive avenues for research.
Exploring advanced models and features: The research of studies24 and25 suggests that using advanced NLMs such as BERT, ELECTRA, or RoBERTa would bring additional performance gains. Future work aims to integrate the models by either using them as feature extraction units or training them within the framework to capture writing style’s understandings. The performance could potentially improve through the inclusion of extra stylometric features including syntactic structures and psycholinguistic markers.
Validating ensemble configuration: We need to study ensemble configuration through ablation experiments with hyperparameter optimization to improve either its operational efficiency or its performance capabilities. The research demonstrates successful integration of three separate feature types using specific CNN networks yet future assessments should evaluate different ensemble configurations which include model number and diversity. The process of validating ensemble configurations with ablation studies or hyperparameter optimization will reveal the best ensemble setup to achieve maximum accuracy and efficiency during authorship attribution operations.
We seek to create authorship attribution systems that provide accurate results in addition to being applicable to various real-life situations through the implementation of these suggested directions.
Conclusion
Author identification is a very important task in many fields, and automatic author identification is an important subfield of it. It can assist in identifying cases of plagiarism by comparing the writing style of the text to the works of other authors. In situations where the creator of a literary work is unknown, these models can help to reveal the actual author. Also, it is useful when the writing style can be used in forensic analysis to determine the author of an anonymous message or letter. This work presented a new method for identifying authors that has several advantages over the existing techniques. The proposed method has some advantages, such as better representation of the attributes related to the writing styles using a combination of various features and improved accuracy using an ensemble of CNNs. The average accuracy of 78.44%, with a minimum improvement of 3.09%, indicates that the proposed method performs better than the baseline methods for the identification of authors of Chinese literary works. The ROC curve analysis further supported the effectiveness of the proposed model in terms of achieving a higher TPR of identifying the texts related to each author while maintaining a low FPR of misidentification. Last but not least, the one-way ANOVA test verifies the null hypothesis, indicating that there is a significant difference in accuracy between the proposed model and the compared methods.
Author contributions
Yuan Zhang wrote the main manuscript text. Yuan Zhang reviewed the manuscript.
Data availability
The datasets generated during and/or analysed during the current study are available in the Zenodo repository, https://doi.org/10.5281/zenodo.17507121.
Declarations
Competing interests
The authors declare no competing interests.
Abbreviations
BERTBidirectional encoder representations from transformers
LSTMLong-short term memory
BOWBag of words
MLMachine learning
CNNConvolutional neural network
NLPNatural language processing
DLDeep learning
ROCReceiver operating characteristic
FPRFalse positive rate
SGDStochastic gradient descent
GRUGated recurrent unit
SVMSupport vector machine
LRLogistic regression
TF-IDFTerm frequency inverse document frequency
LSALatent semantic analysis
TPRTrue positive rate
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1. Rehman, A; Naz, S; Razzak, MI; Hameed, IA. Automatic visual features for writer identification: A deep learning approach. IEEE access; 2019; 7, pp. 17149-17157. [DOI: https://dx.doi.org/10.1109/ACCESS.2018.2890810]
2. T. C. Mendenhall, "The characteristic curves of composition," Science, pp. 237–246, 1887.
3. Adams, J; Williams, H; Carter, J; Dozier, G. "Genetic Heuristic Development: Feature selection for author identification," in. IEEE Symposium Comput.l Intell. Biomet. Identity Manag. (CIBIM); 2013; 2013, pp. 36-41.
4. N. Tarmizi, S. Saee, and D. H. A. Ibrahim, "Author Identification: Performance Comparison using English and Under-Resourced Languages," in Journal of Physics: Conference Series. 052057 2020.
5. Vijayakumar, B; Fuad, MMM. A new method to identify short-text authors using combinations of machine learning and natural language processing techniques. Procedia Comput. Sci.; 2019; 159, pp. 428-436. [DOI: https://dx.doi.org/10.1016/j.procs.2019.09.197]
6. Alhuqail, NK. Author identification based on nlp. Eur. J. Comput. Sci. Info. Tech.; 2021; 9, pp. 1-26.
7. Semma, A; Hannad, Y; Siddiqi, I; Djeddi, C; El Kettani, MEY. Writer identification using deep learning with fast keypoints and harris corner detector. Expert Syst. Appl.; 2021; 184, [DOI: https://dx.doi.org/10.1016/j.eswa.2021.115473] 115473.
8. Pavelec, D; Justino, E; Oliveira, LS. "Author identification using stylometric features," Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial; 2007; 11, pp. 59-65.
9. S. F. Pratama, A. K. Muda, and Y.-H. Choo, "Feature selection methods for writer identification: A comparative study," in Proceedings of 234–239 2010.
10. Chaski, CE. Empirical evaluations of language-based author identification techniques. Forensic linguistics; 2001; 8, pp. 1-65.
11. Iqbal, MM; Raza, A; Aslam, MM; Farhan, M; Yaseen, S. A stylometric fingerprinting method for author identification using machine learning. Tech. J.; 2023; 28, pp. 28-35.
12. S. Nirkhi and R. V. Dharaskar, "Comparative study of authorship identification techniques for cyber forensics analysis," arXiv preprint arXiv:1401.6118, 2013.
13. J. Houvardas and E. Stamatatos, "N-gram feature selection for authorship identification," in International conference on artificial intelligence: Methodology, systems, and applications. 77–86 2006.
14. C. Qian, T. He, and R. Zhang, "Deep learning based authorship identification," Report, Stanford University. 1–9, 2017.
15. N. E. Benzebouchi, N. Azizi, N. E. Hammami, D. Schwab, M. C. E. Khelaifia, and M. Aldwairi, "Authors’ Writing Styles Based Authorship Identification System Using the Text Representation Vector," in 2019 16th International Multi-Conference on Systems, Signals & Devices (SSD). 371–376 2019.
16. W. Ma, R. Liu, L. Wang, and S. Vosoughi, "Towards improved model design for authorship identification: A survey on writing style understanding," arXiv preprint arXiv:2009.14445, 2020.
17. Stoean, C; Lichtblau, D. Author identification using chaos game representation and deep learning. Mathematics; 2020; 8, 1933. [DOI: https://dx.doi.org/10.3390/math8111933]
18. Saedi, C; Dras, M. Siamese networks for large-scale author identification. Comput. Speech Language.; 2021; 70, 101241. [DOI: https://dx.doi.org/10.1016/j.csl.2021.101241]
19. Yülüce, İ; Dalkiliç, F. Author identification with machine learning algorithms. Int. J. Multidisciplinary Stud. Innov. Technol.; 2022; 6, pp. 45-50.
20. Abbasi, A; Javed, AR; Iqbal, F; Jalil, Z; Gadekallu, TR; Kryvinska, N. Authorship identification using ensemble learning. Sci. Rep.; 2022; 12, 9537.2022NatSR.12.9537A [DOI: https://dx.doi.org/10.1038/s41598-022-13690-4] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35680983][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9184563]
21. Tang, X. Author identification of literary works based on text analysis and deep learning. Heliyon.; 2024; 10, e25464. [DOI: https://dx.doi.org/10.1016/j.heliyon.2024.e25464] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38327475][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10848006]
22. Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res.; 2006; 7, pp. 1-30.2274360
23. Benavoli, A; Corani, G; Demšar, J; Zaffalon, M. Time for a change: A tutorial for comparing multiple classifiers through Bayesian analysis. J. Mach. Learn. Res.; 2017; 18,
24. Huang, B., Chen, C., & Shu, K. (2024). Can Large Language Models Identify Authorship?. arXiv preprint arXiv:2403.08213.
25. Gargiulo, F; Minutolo, A; Guarasci, R; Damiano, E; De Pietro, G; Fujita, H; Esposito, M. An electra-based model for neural coreference resolution. IEEE Access; 2022; 10, pp. 75144-75157. [DOI: https://dx.doi.org/10.1109/ACCESS.2022.3189956]
26. Korbak, T., Shi, K., Chen, A., Bhalerao, R. V., Buckley, C., Phang, J., Perez, E. Pretraining language models with human preferences. In International Conference on Machine Learning (pp. 17506–17533). PMLR (2023, July).
27. Trotta, D., Guarasci, R., Leonardelli, E., & Tonelli, S. (2021). Monolingual and cross-lingual acceptability judgments with the Italian CoLA corpus. arXiv preprint arXiv:2109.12053.
28. Zhang, Z., Liu, Y., Huang, W., Mao, J., Wang, R., & Hu, H. (2023). MELA: Multilingual Evaluation of Linguistic Acceptability. arXiv preprint arXiv:2311.09033.
29. Wen, Z., Guo, D., & Zhang, H. (2024). AIDBench: A benchmark for evaluating the authorship identification capability of large language models. arXiv preprint arXiv:2411.13226.
30. Setzu, M; Corbara, S; Monreale, A; Moreo, A; Sebastiani, F. Explainable authorship identification in cultural heritage applications. ACM J. Comput. Cult. Herit.; 2025; 17,
31. Mächtle, F; Serr, JN; Loose, N; Sander, J; Eisenbarth, T. Ocean: Open-world contrastive authorship identification. International Conference on Applied Cryptography and Network Security; 2025; Springer: pp. 459-486. [DOI: https://dx.doi.org/10.1007/978-3-031-95764-2_18]
32. Gaviria de la Puerta, J; Pastor-López, I; Tellaeche, A; Sanz, B; Sanjurjo-González, H; Cuzzocrea, A; Bringas, PG. An innovative framework for supporting content-based authorship identification and analysis in social media networks. Logic J. IGPL; 2024; 32,
33. Win, T. S. Y. Authorship Identification System Using Word2Vec Word Embedding Model. In 2024 IEEE Conference on Computer Applications (ICCA) 1–9 IEEE (2024, March).
34. Boukhaled, MA. February). A Machine Learning based Study on Classical Arabic Authorship Identification. In ICAART; 2022; 1, pp. 489-495.
35. Abbas, S; Alsubai, S; Sampedro, GA; Abisado, M; Almadhor, AS; Kryvinska, N; Zaidi, MM. Active learning for news article’s authorship identification. IEEE Access; 2023; 11, pp. 98415-98426. [DOI: https://dx.doi.org/10.1109/ACCESS.2023.3310813]
36. Artama, M., Sukajaya, I. N., & Indrawan, G. Classification of official letters using TF-IDF method. In Journal of Physics: Conference Series 1516 1 012001. IOP Publishing (2020, April).
37. Di Gennaro, G; Buonanno, A; Palmieri, FA. Considerations about learning Word2Vec. J. Supercomput.; 2021; 77, pp. 1-16. [DOI: https://dx.doi.org/10.1007/s11227-021-03743-2]
corrected publication 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.