Content area
As a significant natural language processing task (NLP), Arabic text classification is essential for efficiently processing and analyzing Arabic language content in various digital forms, such as information retrieval, sentiment analysis, and topic modeling. Deep Learning architectures, such as convolutional neural networks (CNN) and long short-term memory (LSTM), have been widely utilized to categorize and organize language contents accurately to improve the autonomy and perception of NLP tasks. In this paper, we develop a hybrid deep learning framework for Arabic text classification, using the Inception-CNN (introduced in the GoogleNet architecture) and the LSTM (variation of the Recurrent Neural Network). Specifically, the proposed system has been trained and evaluated on two datasets of an Arabic articles dataset, viz. SANAD and NADiA datasets. Consequently, several variations of the model architecture have been configured, trained, evaluated, and compared, with the aim of obtaining the best model architecture and hyperparameters. Our best experimental evaluation showed that the proposed hybrid system (Inception CNN with and LSTM) yielded an accuracy of 92% and 96% for the Akhbarona and AlKhaleej datasets, respectively. At the same time, the entire SANAD data set also yielded a high accuracy of 92%. Lastly, comparing with the state-of-the-art models revealed the superiority of our hybrid model, which outperformed the other architectures in the same area of study, the accuracies have been improved by 1% to 30% for the different datasets.
Article Highlights
Proposing a model that combines the Inception module (CNN architecture) and LSTM for Arabic Text Classification
Research conducted on a low-sourced language; Arabic, using the datasets SANAD and NADiA.
The proposed model has yielded an accuracy of 92% for SANAD and 89% for NADiA, which outperformed other compared architectures.
Introduction
Natural Language Processing (NLP) is a discipline that emerged decades ago with the emergence of the Internet and the massive text documents that need to be searched, mined, and analyzed. According to [1], many tasks have been presented to process natural language. It starts with the basic tasks of tokenizing, parsing, stemming, lemmatization, and part-of-speech tagging, in addition to more complex tasks: classification, summarization, paraphrasing, question answering, machine translation, etc.
This need for text processing keeps rising as long as the text document types, sizes, and numbers are rising, which has created a challenge for researchers to find the best way to accomplish such tasks. Another challenge that researchers face is dealing with different languages on the Internet.
Text classification is one of the significant tasks that requires extensive research. Text classification has many applications, such as the classification of document topics, classification of tweets, and reviews to discover public opinion, also called sentiment analysis.
In research, text classification has been done using several approaches. Machine learning approaches, such as Decision trees, Support vector machines, Random Forest, Naïve Bayes, and many others, have been used to accomplish this task. Deep learning approaches, though, are more useful in text classifications. Artificial Neural Networks (ANN) and Recurrent Neural Networks (RNN) with variations, such as GRU and LSTM, are all examples of deep learning algorithms used for such tasks. Authors of [2, 3, 4, 5, 6, 7, 8–9], are examples of researchers conducted such algorithms.
According to the survey [10], Large Language Models (LLMs) such as BERT, GPT-4, Llama 2, and Titan, in addition to AraBERT that is trained on Arabic language. These models are used in several NLP tasks, one of them is text classification. The authors have highlighted several advantages of using LLMs in this task, such as their scalability, fine-tuning capabilities, handling long texts and surely showed improvement in accuracy. Nevertheless, LLMs come with certain major drawbacks. High computational costs of training and fine-tuning the LLMs make them not easily used in some environments with low resources. In addition to certain privacy issues and risks that consequent from the possibility of sensitive data leakage. Also, creators of common LLMs usually do not disclose the resources of the data used to train their LLMs, and this may lead to bias and unfair results that cannot be spotted by the users of these products. For these limitations, LLMs have been excluded from our work, and it has been concentrated on simpler architecture (CNN-LSTM) that is proved to be efficient in term of computational powers needed and the fairly high accuracies.
English, the most dominant online language, is frequently used for text classification. Datasets (corpora) for many NLP tasks are available in English. On the other hand, regarding the available datasets, Arabic is considered a low-sourced language on the Internet.
Researchers such as [11, 12, 13, 14, 15, 16–17] faced and described several challenges, besides the lack of proper datasets, when working on the Arabic language. Many tools have been created to process natural languages, but they have mostly been designed for English or English-like languages. Languages differ in several characteristics. When comparing English with Arabic, it can be found that they differ in many points, for example: (A) The number of alphabets, (B) The shapes of alphabets, (C) The syntactic order of words in a sentence is the order of subject, verb, and object, (D) The direction of language typing and reading, and, (E) The diacterization in Arabic, which are symbols added to the letters in a word that change the whole meaning of it even if the same combination of letters is used.
Authors of [18] have highlighted some specific challenges related to Arabic text classification in specific. Surely, the low resources availability of Arabic datasets specialized for text classification tasks is one of the most important challenges. Another major challenge that faces Arabic text classifiers, is the high dimensionality of terms resulted from the features extracted from Arabic text, which is caused by the rich morphology of Arabic language. This richness also requires more complex tokenization process as a pre-processing step before conducting the text classification task. The existence of comparable large number of Arabic words that contain syntactic and semantic ambiguity over other languages is also a vital challenge for a classifier to extract the exact meaning, in addition of the diverse Arabic dialects that add extra terms to Arabic language. Surely, the general challenges aforementioned are also discussed by the authors.
Word embedding approaches are tools created to change the representation of words into vectors so that they can be used as input in deep learning algorithms. This created another challenge when using the Arabic language: finding the best word embedding approach that results in the best accuracy, [12, 13]. In this paper, two Arabic news article datasets, SANAD and NADiA, proposed by [14], are used as the primary dataset for the experiments. A variation of deep learning algorithms is tested: CNN, LSTM, and combined models of CNN and LSTM. [19] proposes a survey describing these algorithms and other deep learning ones. The target is to propose the best algorithm variation for Arabic text classification.
Text classification is popular research among other NLP tasks. Nevertheless, research on Arabic text classification still needs to be improved. Arabic is considered the sixth spoken language worldwide, and the Arabic content on the Internet is rapidly increasing. Therefore, it is necessary to find proper solutions for Arabic text classification.
Compared with other languages, research on Arabic text classification using deep learning approaches is considered even less, as illustrated in the related work section. Even when research is done, basic models are usually used. No special models are created for Arabic text classification tasks that might be more appropriate to extract the language’s special features.
This research presents a hybrid learning model combining the Inception-CNN and LSTM models to improve Arabic text classification. The model is trained and evaluated on recent datasets for Arabic text classification and the SANAD and NADiA datasets. These two datasets consist of Arabic articles that are classified into seven types of topics: Culture, finance, medical, politics, religion, sports, and tech. While the integration of CNN and LSTM has been employed earlier for English text classification, the major portion of the research on Arabic text classification has concentrated on using individual deep or machine learning models (i.e., CNN only, LSTM only, or other DL models). However, few studies have been reported recently that use the combination of CNN with RNN, CNN with GRU, or CNN with LSTM in their basic structure for Arabic text classification with minor fine-tuning in the size or the number of filters. While the Inception module has been employed once for English language text classification, this is the first work that employs the Inception Module (integrated with LSTM) for Arabic language text classification. In this paper, several structures of CNN and LSTM are trained on the same datasets and then compared with the proposed model. Specifically, the main objectives of this research can be summarized as follows:
We develop a new hybrid framework integrating the Inception-CNN and LSTM for Arabic text classification (ATC).
We apply and contrast diverse structures of CNN and LSTM, aiming to attain the optimal model parameter.
We provide extensive experimental results to gain insight into the proposed classification model.
We compare our findings with state-of-the-art ATC solutions and show that our hybrid-based ATC is better than the prior art.
Related work
Text classification is considered one of the most tackled tasks in NLP, according to the increasing research in this scope. It started with machine learning algorithms such as naïve, decision trees, Support Vector machines, and random forests, ending with deep learning algorithms such as Artificial Neural Networks (ANN), RNN with its different variations, LSTM and GRU, and lately, CNN has become an interesting structure used for such a task. This section will concentrate on deep learning models since they are the most novel for text classification.
Text classification has been found helpful in many applications. Classification of sentiments in reviews or tweets is a well-known example. Also, classifying different documents, such as articles, to figure out their categories is considered a highly used application.
The English language has the most significant portion when speaking of natural language processing research, in general, and text classification, specifically. On the other hand, the Arabic language has also been tackled in such tasks, but it still needs a lot of work to cope with the English language.
This section reviews a few state-of-the-art models that approach text classification in English or any language other than Arabic. It will concentrate on recent papers since 2017 to present the most novel approaches to text classification in these languages. Then, the second part of this section reviews the literature on Arabic text classification, trying to highlight the gap in research between these languages.
Text classification for other languages
This subsection reviews the existing models that employ different learning schemes to improve the text classification results of languages other than Arabic. Table 1 summarizes the literature discussed in this subsection. For instance, the authors of [2] have combined the LSTM and CNN architectures as two successive layers to extract features needed for text classification. They have used the stat-of-the-art architecture of these two models, but with changing the size of filters in CNN. Their model has been designed to start with an LSTM that is used to extract essential features related to the inputted text. Then, the output of LSTM is entered into CNN to highlight the most important features affecting the classification. Their model was trained on the IMDB (Internet movie review) dataset and yielded the highest accuracy when using filter size 5X600, which was 91.17%. But, they have encountered an accuracy drop when training longer sentences. Thus, in their future work, they have suggested adding a layer of Attention Mechanism, which has been fixed in the coming described literature.
Similarly, [3] combined the CNN and LSTM architectures, starting with CNN as the second layer after the Word2Vec word embedding layer, then finalizing with the LSTM layer before applying the Softmax classifier for final classification. They have used a Chinese news corpus (Sogou.com) in their training. Then, they individually compared their model with traditional architectures used for text classification: KNN, SVM, CNN, and LSTM. Their hybrid model has achieved the best result, which is 90% accuracy.
Also, [20] have studied the sentiment in tweets gathered in a particular trend, which is #BlackLivesMatter. They have proposed a model that is constructed using CNN and LSTM models, and when testing it on the dataset, it yielded an accuracy of 94%.
In [4], the contributors have changed in the previous approaches using the bidirectional LSTM instead of the forward LSTM only. They have argued that the bidirectional LSTM would enable the model to scan the text forward and backward and thus may add more understanding to the context, and better features may be extracted. Then, they added a convolutional layer that accepts, as an input, a concatenation of the results of the bidirectional LSTM. Finally, a Softmax classifier is used at the end. They also used Word2Vec to present the text before advancing it to the model for the starting word embedding layer. They individually compared their model with SVM, LSTM, and CNN, yielding 0.99 for the f-score, which outperformed the other models.
In [21], the authors have also implemented their model on the IMDB dataset for sentiment analysis purposes. In their model, they started with GLOVE for word embedding, then a CNN layer for feature extraction, and finally, a multi-layer bidirectional LSTM. Their model has yielded 92% accuracy in the testing phase.
In another noticeable research, the researchers of [5] have introduced two models: NA-CNN-LSTM and NA-CNN-COIF-LSTM. In NA-CNN-LSTM, they combined a CNN with LSTM but with a simple modification, which does not apply an activation function in the CNN. As in NA-CNN-COIF-LSTM, the same is done with the CNN, but another variation of LSTM is used. In this model, the input and forget gates are coupled, where the output of the input gate is represented by 1-f (the forget gate). They have trained a subjective and objective dataset on several models: CNN, LSTM, CNN+LSTM, the proposed NA-CNN-LSTM, CNN+COIF-LSTM, with the activation function of the CNN, and finally, the proposed NA-CNN-COIF-LSTM. They have compared the results of these models, where both proposed yielded the highest f-scores of 99.24%.
In [7], a new integration of three deep learning architectures, CNN, bidirectional LSTM, and Attention mechanism, has been developed and proposed. Their proposed model, AC-BiLSTM, starts with a convolutional layer to extract the essential features in a particular phrase. Then, a bidirectional LSTM is added to capture the preceding and succeeding context. Then, two attention mechanism layers are added to specify the weights of the most important words in this phrase. Finally, a softmax function is used for the final classification. They have argued that the presented architecture, consisting of the models above, has increased the accuracy of text classification. They have compared their model with different baseline models, such as SVM, CNN, LSTM, BiLSTM, and others. Several datasets have been used in this comparison; some are used for sentiment classification in movie reviews, and others for question classification. Table 1 illustrates the list of datasets they used. They have accomplished a higher accuracy in all comparisons. Nevertheless, according to the dataset used, the accuracy has ranged from 48.9% to 97% for the proposed model. This large range may be a limitation for their model since it can depend highly on the dataset. A similar architecture with one attention mechanism layer and Word2Vec as the word-embedding layer was proposed in the same context in [8]. The authors have only trained their model over the IMDB dataset and compared it with the individual models of CNN, BiLSTM, and MLP. Their hybrid model has yielded about 90% accuracy and outperformed the other models.
As for [6], they have argued that using a simple architecture of a bidirectional LSTM is enough to yield high text classification results by using several variations of cross-entropy loss, either by using labeled or unlabeled datasets. They have applied several variations of entropy loss: loss minimization, virtual adversarial, and adversarial.
Finally, [9] have proposed a model that consists of an Inception Module and LSTM. The inception Module was introduced by [22] in 2015 and used to create the famous GoogleNet, considered the deepest CNN network. Figure 1 illustrates the basic structure of this module. Its main idea has been inspired by Network in Network research by [23], where it has been suggested that a more complex and deep network gets more accurate results. The inception module in GoogleNet has been designed to train images; thus, the convolution layers use 2D filters.
Fig. 1 [Images not available. See PDF.]
Inception module
In their work, [9] have used this Inception module as part of their proposed model. Their model starts with an embedding layer followed by an LSTM layer and then an Inception module. Since, in their model, the task was text classification, they have updated the filters in the Inception Module to become 1D. Their model has been trained on several datasets and gained high accuracy.
Their work has become the main inspiration for this research, using a similar model for Arabic text classification but with specific changes that will be described in detail in the Methodology section.
Table 1. Summary of literature on text classification (Language other than Arabic)
Ref# | Model | Contribution | Datasets | Results |
|---|---|---|---|---|
[2] | LSTM+CNN | Combining two models to achieve text classification task | IMDB (Movie review) | Accuracy of 91.17% when using CNN filter size (5X600) |
[21] | CNN + Multi-layer BiLSTM | Combining two models to achieve sentiment analysis task | IMDB | Accuracy of 92.02% |
[20] | CNN + LSTM | Combining two models to achieve sentiment and emotion detection task | Tweets collected from #BlackLivesMatter | Accuracy of 94% |
[3] | Hybrid CNN-LSTM | Combining two models LSTM and CNN | Chinese news corpus (Sogou.com) | Accuracy of 90% |
[4] | Bi-LSTM-CNN | Using of Bi-LSTM with CNN architecture | THUCNews (a subset) | 0.99 f-score |
[5] | NA-CNN-LSTM and NA-CNN-COIF-LSTM | Combining CNN with LSTM for text classification, but without using Activation function for CNN | Subjective and objective text data | F-Score of 99.24% |
[7] | AC-BiLSTM | Combined architecture of 1D CNN, bidirectional LSTM, and attention mechanism | MR, IMDB, RT-2K, SST-1, SST-2, Subj (Review datasets) TREC (question classification) | Accuracy ranged from 48.9 (SST-1) to 97 (TREC). |
[8] | Attention-based Bi-LSTM+CNN hybrid model | The hybrid architecture proved to be better than individual models | IMDB (Movie Reviews) | Accuracy of 90% |
[6] | BiLSTM with Cross Entropy loss | Using a simple architecture of BiLSTM followed by a max pooling layer. | ACL-IMDB (Sentiment classification) and AG-News (topic classification) | F-score reached 84.1 |
[9] | Inception Module + LSTM | Using a combination of the inception module and LSTM and followed by GlobalMaxPooling | MR, SST-1, SST-2, Subj, TREC | Accuracy ranged from 52.62% on SST-1 to 95.40% on TREC |
Arabic text classification
This subsection reviews the existing models that employ different learning schemes to improve the text classification results of the Arabic language. Table 2 summarizes the literature discussed in this subsection. For instance, a systematic review presented by [24] is summarized, highlighting the main issues to be considered when tackling the problem of Arabic Text Classification. Since this review was published recently in 2020, it has been considered the primary reference in this research for Arabic Text Classification resources.
In their review, [24] have started by explaining the main steps of this task; data pre-processing, text classification, and evaluation. They have stated the primary Arabic datasets that are used in Arabic research. The datasets differ; some are news documents collected from news channels, and others are books, tweets, hotel reviews, and Hadith (the sayings of Prophet Mohammad, peace be upon him). SANAD is a dataset proposed by [14], which has been collected from three different datasets: alalarabiya.net, al Khaleej.ae, and akhbarona.com, and it has been selected as the main dataset for this research.
Authors of [24] have also classified the leading deep learning models used in research for Arabic Text Classification. They have listed CNN, LSTM, RNN, Feed Forward NN, MLP, and others; each has been individually implemented. One research they found used a combined model of CNN and RNN. They have also listed the most common evaluation metrics used for text classification in Arabic: accuracy, precision, recall, and f1-score, which are the standard metrics for most classification tasks.
In the same context, [24] studied the baseline models used for comparison in the reviewed research. These models are traditional ones such as SVM, Naïve Bayes, Decision Tree, Random Forest, KNN, and others.
To continue this section, several state-of-the-art models on Arabic Text Classification have been selected to be discussed. Starting with [14], who proposed two datasets, SANAD and NADiA, which are collected from different news sites. SANAD is a single-labeled dataset, while NADiA is a multi-labeled one. They started their model with Word2Vec word embedding, then trained their datasets on several architectures.
Authors of [12] have used CNN for Arabic text classification. They have trained their model on Arabic articles from three online newspapers. They used TF-IDF for word embedding, compared their model with Logistic Regression (LR) and Support Vector Machine (SVM), and yielded an accuracy of 92.94%.
Several researchers have used variations of RNN for Arabic text classification, such as [13, 15, 16–17]. [15] have proposed two models with RNN variations. One of them is character-level bi-LSTM, and the other is Aspect-based LSTM. Both models were applied to a sentiment analysis task using the Arabic Hotels’ Reviews dataset.
As for [16], they have proposed a model combining both LSTM and one of the RNN variations to classify Arabic tweet sentiments. They have used Word2Vec as the word embedding approach.
Besides, a combination of LSTM and CNN has been proposed by [13]. They have used GloVe as a word embedding approach and trained their model on the BRAD 2.0 dataset, which contains Arabic book reviews.
As for [17], they have worked on the SANAD dataset and trained it using a variety of deep learning architectures. They have used the bidirectional LSTM and GRU and combined both with CNN and a combination of HAN and GRU/LSTM. They compared the results from their experiments and found that the HANGRU model achieved the best accuracy.
In their study, the authors of [25] have focused on proposing robust classification systems for Arabic tests, considering various text embedding methods: Word2Vec, GloVe, FastText, and Bert. By training several deep learning architectures on the SANAD dataset, they have found that BERT outperformed other word embedding techniques under study.
As for [26], they have proposed a semi-supervised machine learning approach. They have converted seven traditional supervised classifiers to semi-supervised ones to leverage the robustness of the proposed classification system.
The work of [27] can introduce another example of Arabic text classification, where tweets are the basic type of text used to train their proposed model. They have proposed a BERT-based text classifier to classify Arabic Tweets into certain categories. A pre-trained BERT model has been used to train several datasets, including SANAD and Arabic tweets. Several deep learning methods have also been evaluated and compared with BERT.
Table 2. Summary of Literature on Arabic Text Classification
Ref# | Model | Contribution | Datasets | Results |
|---|---|---|---|---|
[24] | – | A Systematic Review for Arabic Text Classification research | – | A useful reference for datasets, models, and evaluation metrics used in Arabic research. |
[14] | Several Architectures (CNN, LSTM, GRU, CNN+LSTM, CNN+GRU, and others) | Proposing two Arabic Datasets for text classification, in addition to training them on several deep learning architectures | SANAD, NADiA | GRU with attention mechanism, has yielded the best accuracy for SANAD (96.94%), and NADia (88.68%) |
[12] | TF-IDF for word embedding, and CNN for text classification | From the early papers that used CNN for text classification. | Articles from three Arabic online newspapers (Assabah, Hespress, and Akhbarona) | Resulted in an accuracy peaked at 92.94% when compared with LR and SVM. |
[15] | Two variations of LSTM | Character-level bidirectional LSTM, and Aspect-based LSTM, for sentiment analysis | Arabic Hotels’ Review | F-score of 69.98, when using FastText as word embedding |
[16] | LSTM-RNN | Classification of Arabic tweets sentiments using Word2Vec for word embedding, and LSTM for classification | Arabic Tweets | Accuracy of 93.5% |
[13] | LSTM and CNN and GloVe for word embedding | Proposed a dataset BRAD 2.0, of Arabic Book reviews. Trained and tested using several architectures (LSTM, CNN with different filters, ensemble LSTM+CNN) | BRAD 2.0 | Best Accuracy 90.02% yielded by the ensemble model |
[17] | Several architectures (BiGRU, BiLSTM, CGRU, CLSTM, CNN, GRU, HANGRU, HANLSTM, LSTM) | Comparison between several deep learning architectures for Arabic Text Classification | SANAD | Top accuracy 95.81%, achieved by HANGRU. |
In this research, specific gaps have been covered. The proposed model target was to increase the classification accuracy and decrease the computation time of model creation as much as possible. The proposed model structure has yet to be used in other Arabic language literature. Nevertheless, it has been inspired by literature working on the English language, introduced by [9], but changes have been applied to their model. Two datasets have been used for experiments to confirm the results of each other, which has been achieved by yielding high accuracies with only a 2% difference.
Methodology
Dataset selection and preparation
Two datasets SANAD and NADiA are used in the experiments. They are Arabic datasets proposed by [14]. To start with SANAD, It has been collected from three news websites: AlKhaleej, AlArabiya, and Akhbarona. It contains 194,797 articles, classified into seven classifications (Culture, Finance, Medical, Politics, Religion, Sports, and Tech). Articles in AlKhaleej and Akhbarona are classified into seven classes, while articles in AlArabiya are only classified into six of them, excluding the Religion class. This dataset is considered single-labeled, where each article is classified only with one class. Table 3 summarizes SANAD sub-datasets.
Table 3. Summary of the three datasets contained in SANAD
Dataset | #Articles | #Classes |
|---|---|---|
Akhbarona | 78,050 | 7 |
AlArabiya | 71,247 | 6 (excluding Religion class) |
AlKhaleej | 45,500 | 7 |
In this work, the three sub-datasets are experimented separately at first, then they are gathered as one dataset, and the the same experiments have been conducted on it as well.
Each dataset, has been organized into folders corresponding to the number of classes they represent; Akhbarona and AlKhaleej have seven folders each, and AlArabiya has six folders. Each folder is named by a class name. In each folder, articles are stored as (.txt) files. To prepare data for processing, a simple Python program has been used to read each text file to generate two (.csv) files for each dataset, one for training and the other for testing. The created (.csv) files contain three columns; TextID, Text, and Class.
Each dataset has been divided into two parts; one part for model training and validation and the other part for testing. Since the Akhbarona dataset is not originally balanced, it has been decided to select the same number of articles for each class to be included in the first part; the training and validation. This decision has been taken to guarantee that the trained data is balanced, and thus better results should be achieved. Table 4 illustrates the number of articles included in both parts for each class. When applying the experiments, the train.csv dataset has been divided into 90% training and 10% validation. The same approach have been conducted with AlArabiya dataset, as illustrated in table 5.
Table 4. Detailed description of the sub-dataset Akhbarona
Class | #Articles | Train and Validation | Testing |
|---|---|---|---|
Culture | 6746 | 6000 | 746 |
Finance | 9280 | 6000 | 3280 |
Medical | 12,947 | 6000 | 6947 |
Politics | 13,979 | 6000 | 7979 |
Religion | 7522 | 6000 | 1522 |
Sports | 15,755 | 6000 | 9755 |
Tech | 12,199 | 6000 | 6199 |
Total | 78,428 | 42,000 | 36,428 |
Table 5. Detailed description of the sub-dataset AlArabiya
Class | #Articles | Train and validation | Testing |
|---|---|---|---|
Culture | 5619 | 3500 | 2119 |
Finance | 30,076 | 3500 | 26,576 |
Medical | 3715 | 3500 | 215 |
Politics | 4368 | 3500 | 868 |
Sports | 23,058 | 3500 | 19,558 |
Tech | 4410 | 3500 | 910 |
Total | 71,246 | 21,000 | 50,246 |
As for the Alkhaleej dataset, it was originally balanced with 6500 articles included for each class. It has been decided to divide the data into 80% (Training and Validation) and 20% (testing). Then the training dataset is divided into 90% for training and 10% for validation. Table 6 illustrates the resulting numbers for each class.
Table 6. Detailed description of the sub-dataset AlKhaleej
Class | #Articles | Train and validation | Testing |
|---|---|---|---|
Culture | 6500 | 5300 | 1200 |
Finance | 6500 | 5300 | 1200 |
Medical | 6500 | 5300 | 1200 |
Politics | 6500 | 5300 | 1200 |
Religion | 6500 | 5300 | 1200 |
Sports | 6500 | 5300 | 1200 |
Tech | 6500 | 5300 | 1200 |
Total | 45,500 | 37,100 | 8400 |
Then, the three datasets are gathered, one large dataset, and prepared to train the models that are described in the coming sections. Table 7 illustrates the training and testing dataset counts of SANAD.
Table 7. Detailed description of the dataset SANAD
Class | #articles | Train and Validation | Testing |
|---|---|---|---|
Culture | 18,860 | 14,796 | 4064 |
Finance | 45,812 | 14,792 | 31,020 |
Medical | 23,159 | 14,799 | 8360 |
Politics | 24,831 | 14,786 | 10,045 |
Religion | 13,979 | 11,269 | 2710 |
Sports | 45,319 | 14,807 | 30,512 |
Tech | 231,14 | 14,799 | 8315 |
Total | 195,074 | 100,048 | 95,026 |
As for the NADiA dataset [14], it has been designed to be a multi-labeled dataset of the original 485,770 articles. Each article has been labeled in several labels, general and specific. For example, you can find articles labeled as (News) and also (Middle Eastern). Special Python code has been enclosed with the dataset by the authors so that it can be used to generate a CSV file that contains the article text and its labels in each sample. This code has been updated to consider the general labels only (Culture, Finance, Health, News, Religion, Sports, and Technology), and thus, each text will have only one label to be compatible with the SANAD dataset and thus to conduct fair comparisons between them.
The NADiA dataset has been split in the same manner as SANAD to guarantee class balancing. Nevertheless, this split has been used in the first four experiments only; in the last two, a 90-10 split has been conducted for training and testing datasets, and then a further 80-20 split has been conducted to generate the validation dataset. Tables 8 and 9 illustrate this splitting. It can be noticed that the News class, has not been split as stated previously in table 9, because of the huge number of articles included in this class, which creates a large difference between this class and the others, thus a fixed number has been specified for training and the rest of the samples are dedicated for testing.
Table 8. Detailed description of the dataset NADiA split in the first four experiments
Class | #articles | Train and validation | Testing |
|---|---|---|---|
Culture | 16,839 | 10,000 | 6839 |
Finance | 31,387 | 10,000 | 21,387 |
Health | 11,463 | 10,000 | 1463 |
News | 325,453 | 10,000 | 315,453 |
Religion | 18,126 | 10,000 | 8126 |
Sports | 61,383 | 10,000 | 51,383 |
Technology | 21,119 | 10,000 | 11,119 |
Total | 485,770 | 70,000 | 415,770 |
Table 9. Detailed description of the dataset NADiA split in the last two experiments
Class | #articles | Train and validation | Testing |
|---|---|---|---|
Culture | 16,839 | 13,471 | 3368 |
Finance | 31,387 | 25,110 | 6277 |
Health | 11,463 | 9170 | 2293 |
News | 325,453 | 50,000 | 275,453 |
Religion | 18,126 | 14,501 | 3625 |
Sports | 61,383 | 49,106 | 12,277 |
Technology | 21,119 | 16,895 | 4224 |
Total | 485,770 | 388,615 | 97,155 |
Data pre-processing
In this phase, traditional pre-processing steps have been applied. Since the language understudy is Arabic, some of the conventional pre-processing steps are unnecessary, such as lowercase the words and expanding the abbreviations, since Arabic letters do not have lower and upper cases, and abbreviations are not used in Arabic [28].
Other pre-processing steps, suggested by [28], such as removing spaces, punctuation, stop words, and digits, are still needed. Also, diactrization can be included in Arabic documents; in this case, they should also be removed.
Tokenization and word embedding
Text strings or characters cannot be directly used in deep learning models; they should be represented in numerical vectors using word embedding techniques. According to [29], Several word embedding techniques are used, starting with simple one-hot encoding and ending with high-level methods such as Word2Vec, GloVe, and others. In this work, the Keras embedding layer has been used.
Nevertheless, an important step should be applied before the word embedding, which is tokenization, is discussed by [30]. In word embedding, each word that appeared in the dataset articles is represented as a numerical vector. Thus, each unique word (token) in these articles is inserted into a dictionary in the word embedding task. Indeed, this dictionary should be of a fixed value, namely, vocabulary size. This value is usually estimated according to the dataset size. If the dictionary is full, and there are still other words in the articles, all new words are considered unknown (UNK) and given an embedding value of 1.
Proposed model
The proposed model comprises two basic deep learning architectures: the Inception module and LSTM. Figure 2 illustrates the proposed model architecture, where all model layers are illustrated in addition to a block that contains the inception module. The proposed model consists of the following layers:
Embedding layer: Which takes the parameters:
Vocabulary size (number of unique words): 200,000
Embedding dimension (dimension of the vector to represent a single word): 64
Maximum sequence length (maximum number of words in an inputted article): 6220
Inception Module that consists of four concatenated layers:
1D convolutional layer with filter size = 1.
1D convolutional layer with filter size = 1, followed by another one with filter size = 3.
1D convolutional layer with filter size = 1, followed by another one with filter size = 5.
1D Max pooling layer of size = 3 and stride = 3, followed by a 1D convolutional layer with filter size = 1.
1D max pooling layer of size = 3 and strid = 3. The main purpose of this layer is to decrease the input size to the third before entering the LSTM layer.
Dropout layer of 0.5
LSTM layer
Flattening layer, to prepare the input for the dense layer
Another dropout layer of 0.5
Dense layer, to output the predicted class.
Fig. 2 [Images not available. See PDF.]
Proposed model architecture
Model evaluation
To evaluate our models, we have used the standard evaluation metrics usually used to evaluate the performance of classification models [31]. These metrics are, of course, derived from the confusion matrix composed of the number of (True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN)). This includes classification accuracy, precision, recall, and f-score.
1
2
3
4
Model comparison
Five models have been selected to train the selected datasets and thus compare their results with the proposed model. These models are:
CNN-5: consists of a 1D convolution layer with filter size=5, followed by a 1D GlobalMaxPooling layer.
CNN-55: consists of two 1D convoultion layers with filter size = 5, followed by a 1D GlobalMaxPooling layer.
CNN-75: consists of two 1D convolution layers with filter sizes 7 and 5, followed by a 1D GlobalMaxPooling layer.
CNN-75-LSTM: consists of two 1D convolution layers with filter sizes 7 and 5, followed by an LSTM layer and then a 1D GlobalMaxPooling layer.
IM: The state-of-the-art Inception Module, but with 1D conv instead of 2D as originally created.
Experiments
Six experiments have been applied to each dataset, including the proposed model and the five models aforementioned in the previous section.
Work environment
Google Colaboratory (Colab) has been used as the main platform [32], with Python3 as the programming language. TensorFlow (2.7.0) and Keras libraries have been used for deep learning models [33]. Colab has provided an NVIDIA-SMI GPU (16GB) and high-speed RAM (27GB).
Hyper parameters
The following are the main hyperparameters that have been used in our experiments:
Activation function of Inception module convolution layers: ReLU
L2 regularizer of 0.0005 was used in the embedding layer, all Inception Module convolution layers, and the LSTM layer.
L2 regularizer of (0.025) in Dense layer
Activation function used in the Dense layer: Softmax
The optimizer: NAdam
The loss function: CategoricalCrossEntropy
Batch size: 128
Results and discussion
As stated earlier in the methodology section, six experiments have been applied on the same datasets, SANAD including (Akhbarona, AlKhaleej and AlArabiya) and NADiA, for Arabic Text Classification. Table 10 describes a sample of the experiments applied on one of the datasets, Akhbarona. Its primary purpose is to compare the execution time and loss scores for each model.
Table 10. Experiments description in terms of #epochs, training time and loss (Akhbarona dataset)
Model | Akhbarona dataset |
|---|---|
CNN-5 | |
#epochs | 100 |
Execution time (Minutes) | 67 |
Loss | 0.6833 |
Validation loss | 0.6728 |
CNN-55 | |
#epochs | 100 |
Execution Time (Minutes) | 98.7 |
Loss | 0.3864 |
Validation Loss | 0.6146 |
CNN-75 | |
#epochs | 100 |
Execution Time (Minutes) | 98.4 |
Loss | 0.37 |
Validation Loss | 0.5795 |
CNN75-LSTM | |
#epochs | 25 |
Execution Time (Minutes) | 34.9 |
Loss | 0.6972 |
Validation Loss | 0.8297 |
IM | |
#epochs | 100 |
Execution Time (Minutes) | 99.9 |
Loss | 0.4982 |
Validation Loss | 0.5971 |
Proposed Model | |
#epochs | 10 |
Execution Time (Minutes) | 263.5 |
Loss | 0.5404 |
Validation Loss | 0.6662 |
The accuracy of each model has been calculated and compared with the proposed model. Table 11 and Fig. 3 illustrate these comparisons. Since balanced datasets have been used in the experiments, accuracy tends to be a reasonable measure. Nevertheless, for additional evaluation, precision, recall, and F1-score have been calculated for each class in all models; the results are displayed in Tables 12, 13, 14, 15 and 16.
Table 11. Accuracy Comparison between the proposed model and other models on both datasets
Model | Description | Akhbarona | AlKhaleej | AlArabiya | SANAD | NADiA |
|---|---|---|---|---|---|---|
CNN-5 | CNN (conv5) | 0.91 | 0.94 | 0.51 | 0.89 | 0.87 |
CNN-55 | CNN (conv5 + conv5) | 0.91 | 0.96 | 0.55 | 0.93 | 0.89 |
CNN-75 | CNN (conv7 + conv5) | 0.92 | 0.95 | 0.54 | 0.92 | 0.89 |
CNN75-LSTM | CNN (conv7 + conv5) + LSTM + GlobalMaxPooling | 0.89 | 0.96 | 0.19 | 0.92 | 0.88 |
IM | Inception Module | 0.9 | 0.95 | 0.51 | 0.91 | 0.88 |
Proposed Model | Inception Module + MaxPooling (size3, stride3) + LSTM | 0.92 | 0.96 | 0.82 | 0.92 | 0.89 |
Fig. 3 [Images not available. See PDF.]
Proposed model accuracy evaluation for all datasets
Referring to Table 11, it can be noticed that when comparing, the accuracy of experiments differs according to the dataset used. Consider that AlKhaleej is a balanced dataset, while the others are not. Although the training datasets have been manually balanced, the remaining articles in the testing datasets are not. However, the accuracy measure here may make more sense for the AlKhaleej dataset.
It has also been noticed that increasing the convolution layers of the same filter size would be effective in increasing the accuracy over using one filter of the same size in most cases. On the other hand, using several convolution layers with different filter sizes did not have this effect on the accuracy of most datasets.
To confirm that using multiple and different filters may enhance the model, the recall, precision, and f1-score have been calculated for both datasets. Since most of the datasets are unbalanced in nature, these metrics may be more satisfying.
Model CNN75-LSTM has been tried to enhance the model by taking advantage of the characteristics of LSTM, which maintains the necessary relationships between words in the articles. Thus it has been expected to perform better. Unfortunately, this only enhanced the experiment on the balanced dataset (AlKhaleej), while for others, it reduced the accuracy. Also, when checking the f1-score for Akhbarona, it can be noticed that this model could be better.
The inception module (IM) has been applied to all datasets without the LSTM layer added. Using the IM alone has yielded high accuracy scores of 90%, 95% and 91% for Akhbarona, AlKhaleej and SANAD, respectively, which has been an encouragement to use it with LSTM.
In the proposed model, LSTM, along with a MaxPooloing layer of size 3 and stride 3, have increased the accuracy with 2% over IM for Akhbarona and about 1% for AlKhaleej, SANAD, and NADiA. A major enhancement has been noticed for the AlArabiya dataset.
Also, when using MaxPooling with size 3 and stride 3 before the LSTM layer, the input size will be decreased to the third, but the output still holds the best features. These features are inputted into the LSTM layer, where the connections between them are learned.
Another benefit of decreasing the input size before the LSTM to the third is that the execution time will drop, resulting in higher model efficiency. This is a contribution of the proposed model over the one proposed by [9].
Finally, it can be noticed that both CNN-75 and the proposed model have yielded the same accuracy of 92% for the Akhbarona dataset but enhanced the accuracy for AlKhaleej with about 1%. Similar results can be noticed for SANAD and NADiA. Nevertheless, in all cases, CNN-75 has accomplished this accuracy in 100 epochs; the proposed model only needed 10 epochs to yield it. Thus, the proposed model has outperformed the other compared models that consisted of CNN, LSTM, or both, with slightly different percentages that might occur according to the nature of the dataset used.
Only some of the models presented in the related work have been involved in comparing the proposed model with the literature-the ones that worked on Arabic datasets and those that used CNN, LSTM, or both. Since most of these works have evaluated their models using the accuracy measure, it has been decided to use accuracy for comparison purposes, as illustrated in Table 17.
When comparing the results, it can be noticed that the proposed model, especially when using the AlKhaleej dataset, has yielded the best accuracy over the other models, except for the one proposed by [14]. Nevertheless, they have explained that this accuracy, 96.94%, has been gained when applying their model on the SANAD dataset, the three sub-datasets described earlier in table 3. Since these sub-datasets have different numbers of classes, the authors selected only the five mutual classes among them all to train their model on. In our experiments, when training the models on SANAD, the seven classes were kept, with the variance of counts in each class, and still yielded a high accuracy of 92%. Thus, this slightly higher accuracy has been gained due to training a dataset with less number of classes than used in our experiments. Yet, the proposed model has yielded an accuracy of 96% when trained on it. As for the NADiA dataset, our model has outperformed the results of [14].
Limitations and future work
The main limitation of our work that it concentrated on classifying Arabic news articles using two large corpora: SANAD and NADiA. Still our model needs to be validated for generalization by testing its performance in classifying other types of Arabic text. For future work several corpora will be used to evaluate our proposed model, such as:
OSCAT5 corpus for Arabic hate speech detection [34].
ArSenTD-Lev corpus for sentiment analsysis in Arabic tweets for Levantine dialect [35].
AraSenCorpus which is a large Arabic text corpus for sentiment annotation [36].
Conclusion
Text classification in Arabic is a significant task in Natural Language Processing (NLP), and recent studies have demonstrated improved results by employing deep learning algorithms. This paper introduces a novel hybrid framework called ATC-IM-LSTM, which combines the inception-CNN module with Long Short-Term Memory (LSTM) for Arabic text classification. In this hybrid model, integrating the inception module and LSTM has enabled the capturing of multi-scale features and effectively learning local and global patterns (due to the use of the inception module) as well as the modeling of sequential and long-term dependencies in textual data (due to the use of the LSTM module). To evaluate the effectiveness of our proposed framework, we conducted experiments on two notorious Arabic news datasets viz-SANAD, including its sub-datasets and NADiA. The experimental evaluation demonstrated the superiority of the ATC-IM-LSTM, which outperformed the performance of other state-of-the-art models in terms of accurately classifying Arabic texts. The findings of this study contribute to the growing body of research on Arabic text classification and provide insights into the effectiveness of deep learning techniques for this task.
Author contributions
All authors, Eman Alnagi, Rawan Ghnemat and Qasem Abu Al-Haija have contributed equally in the paper, and they all have read and agreed to the published version of the manuscript.
Funding
The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.
Data availability
Datasets used for model evaluation are publicly available on the following links: SANAD: https://data.mendeley.com/datasets/57zpx667y9/1 NADiA: https://data.mendeley.com/datasets/hhrb7phdyx/1
Declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The authors declare no Conflict of interest.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Jurafsky D, Martin JH. Speech and language processing (3rd draft ed.) 2019.
2. Zhang J, Li Y, Tian J, Li T. Lstm-cnn hybrid model for text classification, (IEEE) 2018;1675–1680.
3. Zhang J, Li Y, Tian J, Li T. Lstm-cnn hybrid model for text classification, (IEEE) 2018;1675–1680.
4. Li C, Zhan G, Li Z. News text classification based on improved bi-lstm-cnn, (IEEE) 2018;890–893.
5. Luan Y, Lin S. Research on text classification based on cnn and lstm, (IEEE) 2019;352–355.
6. Sachan DS, Zaheer M, Salakhutdinov R. Revisiting lstm networks for semi-supervised text classification via mixed objective function. 2019;33:6940–8.
7. Liu, G; Guo, J. Bidirectional lstm with attention mechanism and convolutional layer for text classification. Neurocomputing; 2019; 337, pp. 325-338. [DOI: https://dx.doi.org/10.1016/j.neucom.2019.01.078]
8. Jang, B; Kim, M; Harerimana, G; Kang, S-U; Kim, JW. Bi-lstm model to increase accuracy in text classification: combining word2vec cnn and attention mechanism. Appl Sci; 2020; 10, 5841. [DOI: https://dx.doi.org/10.3390/app10175841]
9. Jiang W, Jin Z. Integrating bidirectional lstm with inception for text classification, (IEEE) 2017;870–875.
10. Fields, J; Chovanec, K; Madiraju, P. A survey of text classification with transformers: how wide? how large? how long? how accurate? how expensive? how safe?. IEEE Access; 2024; 12, pp. 6518-6531. [DOI: https://dx.doi.org/10.1109/ACCESS.2024.3349952]
11. Shaalan K, Siddiqui S, Alkhatib M, Abdel Monem A. in Challenges in Arabic natural language processing (World Scientific) 2019;59–83.
12. Boukil, S; Biniz, M; El Adnani, F; Cherrat, L; El Moutaouakkil, AE. Arabic text classification using deep learning technics. Int J Grid Distrib Comput; 2018; 11, pp. 103-114. [DOI: https://dx.doi.org/10.14257/ijgdc.2018.11.9.09]
13. Elnagar, A; Lulu, L; Einea, O. An annotated huge dataset for standard and colloquial Arabic reviews for subjective sentiment analysis. Procedia Comput Sci; 2018; 142, pp. 182-189. [DOI: https://dx.doi.org/10.1016/j.procs.2018.10.474]
14. Elnagar, A; Al-Debsi, R; Einea, O. Arabic text classification using deep learning models. Inf Proc Manag; 2020; 57, [DOI: https://dx.doi.org/10.1016/j.ipm.2019.102121] 102121.
15. Al-Smadi, M; Talafha, B; Al-Ayyoub, M; Jararweh, Y. Using long short-term memory deep neural networks for aspect-based sentiment analysis of Arabic reviews. Int J Mach Learn Cybern; 2019; 10, pp. 2163-2175. [DOI: https://dx.doi.org/10.1007/s13042-018-0799-4]
16. Alwehaibi A, Roy K. Comparison of pre-trained word vectors for arabic text classification using deep learning approach, (IEEE) 2018;1471–1474 .
17. Elnagar A, Einea O, Al-Debsi R. Automatic text tagging of arabic news articles using ensemble deep learning models, 2019;59–66.
18. Elayeb B. Arabic text classification: a literature review, (IEEE) 2021;1–8.
19. Alom, MZ et al. A state-of-the-art survey on deep learning theory and architectures. Electronics; 2019; 8, 292. [DOI: https://dx.doi.org/10.3390/electronics8030292]
20. Rani, S et al. An efficient cnn-lstm model for sentiment detection in# blacklivesmatter. Expert Syst Appl; 2022; 193, [DOI: https://dx.doi.org/10.1016/j.eswa.2021.116256] 116256.
21. Pimpalkar, A et al. Mbilstmglove: embedding glove knowledge into the corpus using multi-layer bilstm deep learning model for social media sentiment analysis. Expert Syst Appl; 2022; 203, [DOI: https://dx.doi.org/10.1016/j.eswa.2022.117581] 117581.
22. Szegedy C, et al. Going deeper with convolutions, 2015;1–9.
23. Lin M, Chen Q, Yan S. Network in network. arXiv preprint arXiv:1312.4400 2013.
24. Wahdan, KA; Hantoobi, S; Salloum, SA; Shaalan, K. A systematic review of text classification research based ondeep learning models in arabic language. Int J Electr Comput Eng; 2020; 10, pp. 6629-6643.
25. Khaled MM, Al-Barham M, Alomari OA, Elnagar A. Arabic news articles classification using different word embeddings, (Springer) 2023;125–136.
26. Alfartosy HH, Khafaji HK. A hybrid technique for arabic text classification using semi-supervised learning, Vol. 3097 (AIP Publishing) 2024.
27. Alruily, M; Manaf Fazal, A; Mostafa, AM; Ezz, M. Automated arabic long-tweet classification using transfer learning with bert. Appl Sci; 2023; 13, 3482. [DOI: https://dx.doi.org/10.3390/app13063482]
28. El Kah, A; Zeroual, I. The effects of pre-processing techniques on arabic text classification. Int J; 2021; 10, pp. 1-12.
29. Wang, S; Zhou, W; Jiang, C. A survey of word embeddings based on deep learning. Computing; 2020; 102, pp. 717-740.
30. Alyafeai Z, Al-shaibani MS, Ghaleb M, Ahmad I. Evaluating various tokenizers for arabic text classification. Neural Proc Lett. 2022;1–23.
31. Vujović, Z. Classification model evaluation metrics. Int J Adv Comput Sci Appl; 2021; 12, pp. 599-606.
32. Carneiro, T et al. Performance analysis of google colaboratory as a tool for accelerating deep learning applications. IEEE Access; 2018; 6, pp. 61677-61685. [DOI: https://dx.doi.org/10.1109/ACCESS.2018.2874767]
33. Géron A. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems (O’Reilly Media) 2019.
34. Mubarak H, Al-Khalifa H, Al-Thubaity A. Overview of osact5 shared task on arabic offensive language and hate speech detection, 2022;162–166.
35. Baly R, Khaddaj A, Hajj H, El-Hajj W, Shaban KB. Arsentd-lev: A multi-topic corpus for target-based sentiment analysis in arabic levantine tweets. arXiv preprint arXiv:1906.01830. 2019.
36. Al-Laith, A; Shahbaz, M; Alaskar, HF; Rehmat, A. Arasencorpus: a semi-supervised approach for sentiment annotation of a large arabic text corpus. Appl Sci; 2021; 11, 2434. [DOI: https://dx.doi.org/10.3390/app11052434]
37. Sun X, Lu W. Understanding attention for text classification, 2020;3418–3428.
Copyright Springer Nature B.V. Jun 2025