Text Classification: How Machine Learning Is

Full text

Turn on search term navigation

1. Introduction

The history of text categorization (TC) is a narrative of continuous evolution, driven by the growing need to efficiently manage and organize ever-increasing volumes of text data. Initially a manual process rooted in text and corpus linguistics, TC involved categorizing texts into predefined topics or genres [1,2,3]. However, the digital revolution and exponential growth of textual data rendered manual methods impractical, necessitating the development of automated systems. Early approaches relied on handmade features and rule-based systems, which, while foundational, were limited by their rigidity and inability to adapt to new data. Subsequent advancements introduced statistical techniques, nature-inspired algorithms, and graph-based methods to enhance the flexibility and accuracy of text categorization [4].

The introduction of machine learning (ML) marked a turning point, with algorithms like k-nearest neighbors (KNN) and support vector machines (SVMs) offering improved scalability and accuracy by learning directly from data. These methods utilize feature selection techniques to address challenges such as high-dimensional feature spaces and scalability. This shift represented a significant improvement in classification performance and adaptability [1,5,6]. More recently, deep learning has revolutionized TC, enabling the development of models capable of capturing intricate semantic relationships in text. Techniques like recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformer-based models (e.g., BERT) have dramatically improved performance. Additionally, semantic methods such as ontology-based classification and latent semantic indexing have enhanced contextual understanding of text data [7].

Despite its progress, TC remains a field at the crossroads of ML and information retrieval (IR), sharing features with related areas like text mining and knowledge extraction. This overlap has led to fragmented literature, inconsistent terminology, and a lack of standardized frameworks [1,2,3]. Challenges include ambiguous definitions of terms like “automatic text classification”, which variously refers to assigning predefined categories, creating new categories, or clustering texts [8,9]. Furthermore, the field lacks comprehensive resources such as dedicated textbooks or journals, hindering the consolidation of knowledge and impeding newcomers [10].

However, these gaps present opportunities for advancement. By developing systematic methodologies, standardizing terminologies, and centralizing resources, researchers can unify the field and enhance its applicability. The absence of structured guidance also underscores the potential for innovative contributions, such as creating frameworks that bridge theory and practice or addressing evolving challenges like multilingual classification, noisy data handling, and explainability in models.

This research paper presents an extensive survey of text classification and machine learning, offering a unified framework that consolidates best practices from ML, natural language processing (NLP), and information retrieval (IR). It introduces a comprehensive taxonomy of text classification techniques, encompassing traditional algorithms, modern ML approaches, and emerging trends in deep learning. Furthermore, the paper provides a detailed evaluation of methods using standardized datasets and metrics, making it a foundational resource for researchers and a practical guide for industry professionals.

Building on these advancements, recent years have ushered in transformative trends that further define the state of the art in text categorization. Post-2023, the field has seen significant progress in the development of fine-tuned transformer architectures and advanced pretraining techniques, enabling models to excel in few-shot and zero-shot classification scenarios. These approaches address the persistent challenge of limited labeled data by leveraging vast unlabeled corpora and contextual knowledge encoded during pretraining.

Furthermore, the rise of domain-specific language models has reshaped applications of text categorization in specialized industries such as healthcare, legal systems, and e-commerce. These tailored models enhance performance by integrating domain-relevant semantics and terminology, enabling more precise and context-aware classification. Another notable trend is the optimization of lightweight and efficient transformer models designed for deployment in resource-constrained environments, such as mobile devices and IoT platforms. These advancements are critical as the demand for on-device text categorization continues to grow, particularly in applications like real-time content filtering and personalized recommendations.

The field has also witnessed renewed emphasis on multilingual and cross-lingual text classification techniques. Innovations in transfer learning and adaptive fine-tuning have enabled models to process diverse languages within unified frameworks, making significant strides toward global accessibility and inclusivity. In parallel, researchers are addressing persistent challenges such as model interpretability, bias mitigation, and handling noisy or imbalanced datasets. Ethical considerations have gained prominence, with a focus on ensuring fairness, transparency, and accountability in deploying TC systems for high-impact applications like automated moderation and misinformation detection. These recent advancements and ongoing efforts underscore the dynamic nature of text categorization, highlighting its expanding relevance across disciplines and industries. By consolidating these trends and providing a robust evaluation of emerging techniques, this paper aims to bridge the gap between foundational knowledge and cutting-edge research, offering a resource that is both comprehensive and forward-looking.

The paper is divided into eleven sections to help readers navigate this comprehensive survey. Section 1 provides an overview of the historical and technological backdrop for text categorization (TC), as well as its problems and objectives. Section 2 examines the scope and role of TC, distinguishing it from comparable tasks and summarizing previous research. Section 3 describes the study approach, whereas Section 4 and Section 5 cover significant applications and machine learning techniques in TC. Section 6 of the study delves into foundational and advanced document representation strategies, followed by Section 7’s exploration of evaluation measures. Section 8 addresses TC challenges, while Section 9 discusses recent breakthroughs in deep learning. Finally, Section 10 discusses future directions, and Section 11 provides a summary of the study’s contributions.

2. Background

Section 2 conducts a thorough review of text categorization (TC), separating it from related topics and describing recent advances in the field. This section examines the fundamental principles of TC, highlights noteworthy studies from 2019 to 2024, and identifies key issues and developing trends. By breaking down the subject into logical subsections, the section hopes to provide a unified narrative that connects theoretical and practical aspects of TC.

Overview of Text Categorization (TC) and Recent Research

Text categorization (TC), also known as text classification, is a core task in text mining and natural language processing (NLP). Text categorization plays an integral part in managing and organizing unstructured data, employing machine learning techniques to assign predefined categories to text documents. This process enables efficient information retrieval and analysis, with applications in sentiment analysis, spam detection, and topic classification. By streamlining activities such as content filtering and subject identification, TC enhances productivity and supports decision making.

Historically, TC was performed manually, which was suitable for small datasets but lacked scalability, consistency, and speed, especially with dynamic data like social media. The advent of automated TC, driven by machine learning (ML) and NLP, revolutionized the field by increasing speed, accuracy, and scalability while reducing human bias. Automated TC is now widely adopted in industries that process extensive and rapidly growing data streams [10].

The domain of text categorization has experienced remarkable advancements, propelled by continuous innovations in machine learning (ML) and natural language processing (NLP). Over the years, researchers have developed sophisticated techniques to enhance classification accuracy, scalability, and adaptability across diverse datasets and applications. These developments span traditional machine learning approaches, deep learning architectures, and hybrid models that blend the strengths of both paradigms. This evolution has not only improved the precision of text categorization systems but also expanded their relevance to areas such as sentiment analysis, spam detection, topic identification, and domain-specific classification.

One of the most impactful trends in recent research is the integration of transfer learning techniques, which allow models to influence knowledge from pretrained language illustrations. This approach has significantly boosted the effectiveness of text categorization, particularly in handling low-resource languages and niche domains. Additionally, studies have explored the use of hybrid methodologies, combining rule-based systems with advanced machine-learning practices to address challenges in multilingual and domain-specific contexts. These approaches underscore the growing importance of adaptability and context-awareness in modern text categorization systems.

Recent studies in 2024 have placed a particular emphasis on leveraging pre-trained language models, domain-specific adaptations, and innovative clustering techniques. These advancements have demonstrated special effectiveness in managing complex, multidimensional datasets, enabling more nuanced and accurate classifications. The following table highlights key contributions from recent studies from 2019 to 2024, providing an overview of cutting-edge methodologies and findings. Table 1 summarizes key contributions in text categorization.

3. Method

3.1. Search Strategy and Databases

To comprehensively cover the body of literature on text classification, we utilized multiple academic databases, including PubMed, Web of Science, IEEE Xplore, Scopus, Google Scholar, ACM Digital Library, ScienceDirect, JSTOR, ProQuest, SpringerLink, and EBSCOhost. These databases were chosen for their extensive coverage of scientific and scholarly publications in technology, computer science, and machine learning.

Our search strategy was systematic, combining relevant keywords and Boolean operators to ensure a comprehensive collection of articles. Keywords included “text classification”, “machine learning in text analysis”, “document categorization”, “natural language processing”, “text mining”, “feature selection for classification”, and “supervised learning for text”. The search string was refined iteratively, guided by recent reviews on text classification and related fields (e.g., [29]). This rigorous approach ensured the retrieval of relevant and high-quality research articles for our analysis.

3.2. Inclusion and Exclusion Criteria

To streamline the review process and ensure the quality and relevance of the selected studies, we established explicit inclusion and exclusion criteria.

Inclusion Criteria:

Published peer-reviewed articles focusing on text classification using machine learning techniques.
Studies presenting experimental results on various classification algorithms.
Articles published in English.
Papers discussing challenges and advancements in text classification, including preprocessing, feature selection, and evaluation metrics.
Conference proceedings, book chapters, and review articles relevant to text classification.

Exclusion Criteria:

Articles only tangentially related to text classification or focusing on unrelated machine learning domains.
Papers lacking experimental results or substantive analysis.
Secondary sources not published in English.
Studies addressing text classification superficially without detailed methodological or algorithmic discussion.

3.3. Data Extraction and Analysis

Once the final selection of articles was made based on the inclusion and exclusion criteria, data extraction focused on critical aspects of the studies. Extracted data included authors, publication year, study design, classification techniques used, datasets employed, preprocessing methods, feature selection approaches, performance metrics, primary findings, and conclusions.

The data analysis followed a narrative synthesis approach to accommodate the diversity of studies [30]. Descriptive analysis highlighted bibliometric characteristics such as the number of studies; publication trends over time; countries of origin; and frequently used datasets.

Thematic analysis categorized findings into recurring themes, such as:
Preprocessing and feature engineering techniques in text classification.
Evaluation of machine learning algorithms for classification tasks.
Performance metrics and benchmarks in text classification.
Challenges in real-world applications, such as scalability and bias.
Emerging trends and innovations, including deep learning and transformer-based models.

This systematic approach ensured that the study rigorously examined the state of research on text classification, offering insights into current advancements, limitations, and future directions.

4. Approaches to Text Categorization

TC employs a variety of machine learning practices, broadly categorized into supervised, unsupervised, and deep learning methods:

Supervised Learning

In supervised learning, models are trained using labeled datasets to classify new documents into predefined categories. Popular algorithms for this purpose include logistic regression, naive Bayes, random forest, support vector machines (SVMs), and AdaBoost. For example, naive Bayes has demonstrated impressive accuracy, reaching up to 96.86% in certain applications [23].
Deep learning techniques, such as convolutional neural networks (CNNs) and long short-term memory (LSTM) networks, achieve high accuracy while requiring minimal feature engineering. For instance, LSTMs have demonstrated accuracy rates of up to 92% in specific tasks [31].

Unsupervised Learning

Unsupervised learning techniques, including hierarchical clustering, k-means clustering, and probabilistic clustering, are used to group documents based on content similarity in cases where labeled data are not available [32].
These techniques uncover inherent data structures and are instrumental in analyzing unlabeled datasets [4].

Advancements in TC leverage feature extraction and dimensionality reduction techniques like PCA and LDA, enhancing model performance. Deep learning models further refine TC by capturing linguistic subtleties such as tone and context, making them invaluable for tasks like sentiment analysis. These developments position TC as a vital tool across industries for deriving insights from text data and managing digital information environments [31].

Despite its benefits, TC faces challenges such as handling ambiguous or overlapping categories and requiring large labeled datasets for supervised learning. Algorithm selection also influences outcomes, with models like naive Bayes and SVM performing differently across datasets and applications [33].

4.1. The Rise of Machine Learning in TC

Machine learning (ML) has significantly advanced TC by transitioning from rule-based systems to adaptive algorithms that learn from tagged input. Early ML models like naive Bayes and SVM laid the groundwork for TC.

Naive Bayes: Effective for large vocabularies due to its probabilistic approach.
SVM: Achieves precision by mapping text into high-dimensional spaces, helping identify closely related themes.

The emergence of deep learning further transformed TC with models like CNNs and RNNs, capable of capturing local word dependencies and long-term correlations. Transformer models, such as BERT, have set new benchmarks by understanding bidirectional, long-distance interactions in text. These innovations enable tasks like sarcasm and emotion detection with minimal fine-tuning. Additionally, metrics like burstiness and perplexity enhance TC by identifying significant phrases and quantifying prediction uncertainty.

Applications of ML-driven TC span customer service, healthcare, finance, and more, enabling rapid and precise classification. This supports innovations in content moderation, personalized recommendations, and trend analysis [1].

4.2. Benefits of Automated TC over Manual Classification

Automated TC offers numerous advantages over manual processes:

Scalability and Efficiency: Handles large datasets rapidly and consistently, unlike manual methods that are time-intensive and impractical for extensive collections.
Objectivity: Applies standard criteria uniformly, eliminating human bias and ensuring reliable outcomes, crucial for domains like legal document classification.
Real-Time Processing: Facilitates immediate classification, essential in industries like finance and journalism where timely decisions are critical [34].

Advanced ML approaches like burstiness and perplexity improve TC by addressing dynamic settings. Burstiness measures fluctuation in word occurrence, allowing for improved detection of significant terms, whereas perplexity evaluates uncertainty in text predictions, enhancing adaptation to changing datasets. These metrics improve model performance in complicated, dynamic situations [35].

4.3. Types of TC Tasks

Text categorization assigns predefined categories to free-text documents, organizing them conceptually for efficient retrieval and management [5]. Applications include email filtering, topic labeling, and content organization for digital libraries.

TC tasks vary depending on the nature of the classification problem. Common types include:

Binary Classification: This involves two classes, such as spam and non-spam emails, where each document belongs to one of the two categories [5].
Multiclass Classification: More than two classes are involved, and each document is assigned to only one class, such as classifying news articles into topics like politics, sports, or entertainment [5].
Single-Label Classification: Often approached using binary classification methods, where documents are classified into distinct categories without overlap [36].
Multilabel Classification: In this case, each document may belong to multiple categories simultaneously. For example, an academic paper may be categorized under multiple disciplines like biology and technology [5].
Hierarchical Classification: Documents are classified into categories supervised in a hierarchical format. This type is beneficial for large datasets with numerous categories [5].

4.4. Document-Pivoted vs. Category-Pivoted TC

Document-pivoted and category-pivoted text categorization represent two methodologies for organizing the classification process.

(a). Document-Pivoted Categorization (DPC): This approach focuses on classifying a document by searching across all possible categories. It is generally simpler to implement and more efficient for practical applications [37].
(b). Category-Pivoted Categorization (CPC): In contrast, CPC classifies documents by first identifying the relevant category. This method is more complex, as it requires re-evaluating document classifications when new categories are added.

4.5. Hard Categorization vs. Ranking

Hard categorization and ranking are two distinct approaches to classifying documents.

(a). Hard Categorization: This method assigns each document to a single category, resulting in binary decisions about the classification of whether the text belongs to the category or not.
(b). Ranking: In contrast, ranking categorization involves generating a list of categories ranked by their relevance to the document. This approach provides a more nuanced view of a document’s classification, allowing for further decision-making processes based on category probability. Figure 1 and Figure 2 represent hard categorization and ranking categorization.

4.6. Machine Learning vs. Knowledge Engineering in TC

Both machine learning and knowledge engineering play significant roles in the development of text categorization systems.

4.6.1. Machine Learning

Machine learning algorithms automatically learn from data without prior programming to enable systems to adapt and make predictions or decisions according to patterns of interest in the data [39]. These algorithms can be supervised, unsupervised, and reinforcement learning methods that improve categorization accuracy by finding and analyzing complex patterns, trends, and relationships in large datasets. This is important and beneficial when working with large, complex data to achieve a more granular and efficient categorization process.

Machine learning methods have been extensively employed in the field of text categorization for predictive modeling. Essentially, historical or prelabeled data are used to train algorithms which are later applied to new, unseen data to categorize them effectively [40]. These techniques give rise to the development of machine learning that allows a TC system to extend beyond manual or rule-based approaches to text categorization, scalable and adaptable to large volumes of text [39]. In addition, iterative machine learning makes systems more adaptable by allowing models to learn from feedback loops and updated datasets. Each iteration improves the model’s performance, allowing it to adapt to new patterns, eliminate errors, and deal with dynamic and changing data more efficiently. This process enables continual improvement and relevance across a wide range of applications.

4.6.2. Knowledge Engineering

Knowledge engineering focuses on emulating human expert decision-making processes in specific domains [40]. Knowledge engineering involves creating systems that replicate the decision-making processes of human experts in specific fields. It focuses on capturing and representing expert knowledge, such as rules and reasoning, to develop systems that can analyze problems and provide accurate solutions. These systems are widely used in specialized domains to enhance problem-solving and decision-making capabilities. It involves creating expert systems that can utilize rules and data to facilitate complex problem solving. In the context of TC, knowledge engineering systems integrate human expertise with machine learning outputs to enhance decision-making capabilities and ensure accuracy in categorizations [41].

5. Applications of Text Categorization

This section delves into the numerous uses of text categorization (TC) across disciplines, emphasizing its importance in improving information management, operational efficiency, and customer engagement. It highlights the practical impact of TC techniques by focusing on specific use cases such as document indexing, content customization, and hierarchical web content classification. The detailed subsections explain how TC helps organizations explore and extract value from vast amounts of text data. Text categorization is integral to many sectors, bringing with it a host of benefits that include better information management, enhanced customer engagement, and operational efficiency [42].

5.1. Document Indexing for Information Retrieval Systems

Document indexing involves treating documents with keywords or key phrases to facilitate retrieval in Boolean IR systems. In order to avoid inconsistencies in the tags assigned to the documents, controlled dictionaries or thematic thesauri like the MeSH thesaurus for medicine are used [43]. Though manual indexing has been considerably replaced by automated indexing, it helps to manage large databases efficiently in research and library systems.

Role of Controlled Vocabulary and Thesauri

Controlled vocabulary helps standardize the terminology in certain fields and thus supports the consistent categorization of documents. The thesauri give a hierarchical and relational context to the terms, thereby making the search and retrieval processes in systems using TC more effective, particularly in large-scale document databases [44].

5.2. Automated Document Organization and Archiving

For large document bases, TC automates the filing system for corporate records, patent filings, and other institutional archives. The tools can classify patents or group news stories by theme, which can cut down on manual classification workload [45].

Use in Corporate and News Media

Corporations use TC to filter incoming information, such as routing relevant documents to specific departments. News agencies use TC to precategorize articles before publication, for example, placing content under “Politics” or “Lifestyle” [46]. In high-volume environments, this is particularly important for facilitating streamlined operations and maintaining supervised archives.

5.3. Text Filtering and Content Personalization

Content personalization in TC assists in tailoring content to user preferences through the classification of information that is stored according to user profiles. Applications such as personalized news feeds, customized email filtering, and targeted advertisements are performed with systems trained on filtering or promoting content based on precise thematic categories [47]. Content personalization in text categorization (TC) delivers user-specific content by classifying information based on individual profiles. Systems trained in thematic categories enable applications like personalized news feeds, email filtering, and targeted ads. This ensures users receive relevant and tailored experiences.

Newsfeeds, Email Filtering, and Spam Detection

A good example of test categorization is in spam detection, which classifies email content into spam or non-spam categories based on the keywords and patterns of the sender. Filters analyze text content to block unsolicited emails that provide relief to users from irrelevant content [48].

5.4. Word Sense Disambiguation (WSD)

WSD disambiguates polysemous words to recognize the true sense of a term in context. It has been identified as one of the fundamental tasks in natural communication to handle requests and engines and machine translation. Categorization of word senses using WSD provides for more accurate keyword searching and indexing [49].

5.5. Hierarchical Categorization of Web Content

Content categorization taxonomy organizes online information into a hierarchical structure, similar to those used in digital libraries or internet directories. Text categorization (TC) techniques classify websites into nested levels, such as “Technology” > “Artificial Intelligence” > “Machine Learning”, enabling users to navigate vast online repositories with ease and efficiency [50].

6. Machine Learning Techniques in Text Categorization (TC)

Machine learning (ML) has proven essential to the creation of and progress in automated text categorization (TC) systems. Text classification (TC) is the process of categorizing text documents based on their content. ML techniques improve TC by automatically learning from big datasets and refining categorization models, resulting in increased efficiency, accuracy, and flexibility for varied data sources [51].

This section provides a comprehensive overview of machine learning techniques for text classification, focusing on several key areas. It focuses on supervised learning approaches because they are widely used and proven effective in text categorization problems. While unsupervised approaches have advantages, they are outside the focus of this paper because they are often used for exploratory or clustering tasks rather than preset categorization. The section also delves into classifier construction, offering insights into various types of algorithms and their design for effective text categorization. Additionally, it highlights the importance of feature selection and engineering, discussing techniques to identify and refine features that enhance model performance. Lastly, it examines advanced approaches to text categorization, showcasing cutting-edge machine learning methods that improve accuracy and adaptability in classification tasks.

6.1. Supervised Learning Techniques

Supervised learning is the most common strategy in TC, in which labeled data are used to “teach” models how to effectively categorize texts. This method divides datasets into three subsets: training, test, and validation sets. The training set is utilized to build the model; the validation set fine-tunes hyperparameters and evaluates model performance during development; and the test set is reserved for final evaluation to ensure generalizability [23].

Using different datasets reduces the risk of overfitting, which occurs when a model performs well on training data but poorly on different data [52]. Furthermore, effective partitioning ensures balanced and representative data, which is critical for applications such as sentiment analysis and document categorization, where certain terms and contexts must be learned consistently [43].

6.2. Classifier Construction and Types of Algorithms

The choice of algorithm greatly influences the building of classifiers for TC, as it dictates the model’s learning strategy, interpretability, and processing needs [53]. Rule-based systems, decision trees, naive Bayes, and neural networks are among the most used TC algorithms, each with its strengths and shortcomings when dealing with text input.

(a). Rule-based Systems: These classifiers use handmade rules, which are highly interpretable but less flexible for complicated or huge datasets [43]. Rule-based systems, on the other hand, continue to be useful in situations when plain, transparent decision making is required.
(b). Decision Trees: Decision trees divide data based on certain criteria, making them intuitive and interpretable but susceptible to overfitting. Decision trees are effective for small to medium-sized text corpora, but they may struggle with scalability and feature depth [51].
(c). Naive Bayes: Naive Bayes is frequently used in TC due to its simplicity, efficiency, and resilience, especially in document categorization and spam filtering [54]. However, while the assumption of feature independence simplifies calculation, it can reduce efficiency when features are highly linked [53]. Figure 3 shows how Naïve Bayes works.

Nodes:
- The topmost node “C” represents the class label of the text document.
- The nodes labeled F₁, F₂, …, F_n represent features (words, phrases, or attributes) extracted from the text.
Arrows:
- The black arrows from C → {F₁, F₂, …, F_n} indicate that the classification decision directly influences the features. This suggests a Naïve Bayes model assumption, where features are conditionally independent given the class.
- The blue arrows between features represent feature dependencies or correlations (e.g., word co-occurrence relationships). This indicates that some features depend on each other, making the model more complex than Naïve Bayes.
Figure 3
Naive-Bayes-Based Classification [55].

[Figure omitted. See PDF]
(d). Neural Networks: “Neural networks, particularly deep learning models, have transformed TC by allowing them to learn sophisticated, hierarchical text representations. Although neural networks often need big datasets and significant computer resources, they provide unrivaled accuracy in capturing semantic meaning and contextual nuances” [56].

Each of these techniques is used depending on the use case, dataset features, and resource availability, emphasizing the importance of personalized ML approaches in TC.

6.3. Feature Selection and Engineering

Feature selection and engineering are critical in TC because they decide the data qualities the model learns from, which influences its overall performance [57]. Text classification features are often words, sentences, or semantic representations, therefore their selection is critical for increasing model accuracy and interpretability [52]. By focusing solely on important features, effective feature selection eliminates unnecessary data and computational expenses. Term frequency–inverse document frequency (TF-IDF) and word embeddings are popular techniques for capturing the textual structure, context, and importance of words in texts. Furthermore, feature engineering approaches like stemming, lemmatization, and n-gram analysis improve feature representation, hence enhancing classifier performance [43]. Furthermore, high-quality feature selection frequently results in improved generalization across domains and datasets, which is crucial for applications that require models to work in various languages or specialized disciplines [51].

6.4. Advanced Machine Learning Approaches to Text Categorization

The evolution of ML has seen a blend of traditional and advanced models enhancing TC accuracy and efficiency.

6.4.1. Traditional ML Techniques

Naive Bayes and Logistic Regression: Offer simplicity and effectiveness in text classification, with naive Bayes achieving up to 96.86% accuracy in specific datasets [23].
Support Vector Machines (SVMs): Efficiently handle high-dimensional data and demonstrate strong performance with word embeddings.
Random Forest (RF): Achieves a mean accuracy of 99.98% when combined with Word2Vec embeddings [58].
K-Nearest Neighbors (KNN) and Decision Trees: Useful for smaller datasets but less effective compared to SVM and RF [59].

6.4.2. Deep Learning Approaches

Convolutional Neural Networks (CNNs): Capture spatial patterns in text, ideal for classification tasks.
Recurrent Neural Networks (RNNs): RNNs, such as long short-term memory (LSTM) and gated recurrent unit (GRU) architectures, are especially useful for simulating sequential dependencies in text. They excel at jobs that need contextual comprehension, such as sentiment analysis and time-series forecasts.
Transformer-based Models: Transformer-based models, like BERT, have transformed text classification by exploiting self-attention mechanisms to detect global dependencies in text. Their ability to construct contextual embeddings has established new standards for several natural language processing tasks, achieving over 97% accuracy in some applications [58].

Figure 4 illustrates a convolutional neural network (CNN) architecture designed for classifying handwritten digits, such as those found in the MNIST dataset. For grayscale images, the CNN design starts with a multidimensional array of 28 × 28 × 1 pixels. The first convolutional layer generates feature maps of 24 × 24 × n1, where n1 is the number of filters used. Subsequent layers lower spatial dimensions while increasing depth, depending on the number of filters. This approach collects hierarchical features that are useful for text categorization.

6.4.3. Hybrid and Ensemble Methods

Model Combinations: Traditional classifiers paired with similarity measures like cosine similarity enhance performance [24].
Ensemble Learning: Combines diverse models to boost robustness and accuracy in TC tasks [61].

7. Document Representation Techniques

Document representation techniques translate text documents into structured formats that machine learning models can understand to retain as much semantic information as feasible. Effective representation strategies support correct text classification, enhancing models’ ability to recognize and analyze key patterns in textual data [62].

This section explores document processing techniques, providing a detailed examination of foundational and advanced methods. Topics include the vector space model (VSM) and its applications, the evolution from bag-of-words to more sophisticated approaches, and techniques in lexical semantics and text tokenization for understanding textual content. It also highlights word stemming and stop word removal as essential preprocessing steps, along with a discussion on weighting schemes such as term frequency–inverse document frequency (TF-IDF) and other innovative weighting strategies to enhance text analysis and classification.

7.1. Vector Space Model (VSM)

The vector space model (VSM) serves as a fundamental tool for document representation in text categorization. It represents documents as vectors within a multidimensional space, where each dimension corresponds to a distinct term from the corpus [63]. VSM enables the calculation of document similarity using metrics such as cosine similarity, which is useful in applications like clustering, search, and categorization. Figure 5 illustrates how the Vector Space Model operates.

7.2. Bag-of-Words and Beyond

The bag-of-words (BoW) approach in VSM relates to a simple but effective strategy that depicts each text as a gathering of individual phrases, disregarding word order but capturing the frequency of each term. BoW is computationally efficient and successful for many classification applications, but it has drawbacks, such as neglecting word order and semantic nuances [43]. To address these limitations, BoW extensions such as n-grams and distributed representations have arisen, which better capture word context and relationships by taking term sequences into account or employing embeddings [65]. These methods increase the semantic depth of document representation, making them appropriate for more complicated text analysis tasks. Figure 6 shows how the Bag-of-Words model works.

7.3. Lexical Semantics and Text Tokenization

Lexical semantics, when paired with tokenization, divides text into meaningful units while preserving the document’s fundamental information. Tokenization breaks down text into smaller components, typically words or phrases, allowing algorithms to handle text as discrete tokens rather than continuous strings [62].

7.4. Word Stemming and Stop Word Removal

Many TC applications rely heavily on stemming and stop word removal to improve document representation. Stemming reduces words to their base forms, grouping variations of the same term to prevent repetition in representations. For example, “running”, “ran”, and “runner” are all derived from “run”. This simplification enables models to concentrate on key meanings, increasing efficiency and relevance in text analysis [67].

Stop word deletion entails removing common terms like “the”, “is”, and “and” which often add little to document classification. By omitting these keywords, models reduce computational complexity while increasing accuracy by focusing on more informative words. These preprocessing techniques are especially beneficial in fields where separating important phrases from popular ones is critical to accurate categorization [43].

Preprocessing techniques like stemming, lemmatization, and stop word removal are essential for accurate text categorization because they remove noise and simplify textual data. These strategies ensure that models concentrate on relevant patterns by standardizing word forms (e.g., “run”, “running”, and “ran” into “run”) and removing non-informative terms (e.g., “the”, “is”). This simplified form improves both computing efficiency and feature relevance for classification.

7.5. Weighting Schemes and Alternatives

Weighting schemes lend importance to terms in a document, which helps models discover the most relevant aspects for categorization. While TF-IDF is the most extensively used approach, other schemes, such as entropy weighting and BM25, have advantages in some settings, such as controlling term importance across many datasets. Accurate weighting distinguishes phrases that carry significant information from those that do not, which improves classification results [68].

Term Frequency–Inverse Document Frequency (TF-IDF)

One of the most widely used weighting methods in text categorization (TC) is term frequency–inverse document frequency (TF-IDF). This metric evaluates a word’s significance within a text by considering its frequency within the document and its distribution across the entire corpus. The term frequency (TF) component highlights terms that occur frequently in a single document, while the inverse document frequency (IDF) component downscales the weight of terms that are common across multiple documents. This approach ensures a more balanced representation of term importance [69]. TF-IDF has proven effective in various applications, such as document retrieval and categorization, by prioritizing unique and contextually significant terms [62].

In addition to TF-IDF, various weighting techniques such as entropy weighting and BM25 have been investigated to better capture word significance across different contexts [70]. Entropy weighting, for example, assesses each term’s informational contribution across categories, minimizing the impact of highly predictable phrases [71]. The BM25 technique, an extension of TF-IDF, provides an improved strategy for document retrieval tasks by integrating parameters that account for document length and frequency saturation, hence improving performance in big text corpora [72].

These weighting techniques address a wide range of text processing needs while also improving document representation flexibility, making them useful for TC applications that must deal with heterogeneous datasets and complex language patterns.

8. Dimension Reduction in Text Categorization

Dimensionality reduction (DR) is a vital process in text categorization, aimed at addressing the challenge of high-dimensional feature spaces that often characterize text datasets. These datasets can consist of thousands or even millions of unique words, making the feature space complex and computationally intensive. DR techniques help by eliminating noisy or irrelevant terms, thereby enhancing training efficiency and model interpretability without compromising critical information [73]. Additionally, DR mitigates overfitting, a common issue where models are excessively tailored to the training data, hindering their ability to generalize to new, unseen data [1].

Methods such as principal component analysis (PCA) and latent semantic analysis (LSA) are commonly employed to lower dimensionality while retaining the structural and relational integrity of the text data. This streamlined representation facilitates faster processing and more accurate predictions, making dimensionality reduction an indispensable component of the machine learning pipeline.

8.1. Importance of Dimensionality Reduction

Reducing dimensionality offers several key advantages:

Improved Efficiency: Streamlines computational demands, particularly during training and testing phases.
Enhanced Interpretability: Simplifies understanding by focusing on the most significant features.
Reduced Overfitting: Ensures the model learns generalizable patterns rather than noise specific to the training dataset.

These benefits collectively enable the creation of robust and reliable machine learning models, empowering practitioners to derive meaningful insights from complex datasets. Practices such as PCA and t-distributed stochastic neighbor embedding (t-SNE) have proven effective in maintaining essential information while reducing dimensions, thereby improving model performance and the extraction of insights.

Dimensionality reduction techniques focus on identifying discriminative features, which are then weighted and fed into classifiers to construct models. During the testing phase, test documents are preprocessed and represented using the same methods applied during training. This ensures consistency in data handling, leading to more reliable predictions and deeper insights into underlying patterns within the dataset.

8.2. Dimensionality Reduction in Support Vector Machines (SVMs)

Support vector machines (SVMs) benefit significantly from dimensionality reduction, particularly when handling high-dimensional data. The optimization process in SVMs relies on the dual formulation of soft margin SVMs, which transforms the primal optimization problem into a dual problem. This approach leverages kernel functions to efficiently handle non-linear classifications.

Kernel Functions in SVM

Linear Kernel

The linear kernel is represented as:

$K (x, x_i) = (x, x_i)$

This calculates the dot product of two feature vectors, x and x_i. This straightforward kernel works well for linearly separable data, where a linear decision boundary can effectively separate the classes.

2.. Polynomial Kernel

The polynomial kernel is expressed as:

K(x,x_i) = [(x ⋅x_i) + β]^d

where d is the degree of the polynomial and β is a constant. This kernel enables the SVM to capture more complex relationships between data points by considering polynomial interactions of features. The degree d determines the level of complexity in the model, with higher degrees capturing more intricate patterns.

3.. Gaussian RBF Kernel

The Gaussian RBF kernel is given by:

K(x,x_i) = exp(−γ‖x − x_i‖²)

where γ is a parameter that controls the kernel’s flexibility and sensitivity to differences between data points. It maps the data into an infinite-dimensional space, allowing the SVM to create highly non-linear decision boundaries. The parameter γ determines how closely the model fits the data, with larger values resulting in tighter fits around individual data points and smaller values producing smoother decision boundaries [74].

While kernel functions mitigate the impact of high feature space dimensions on computational complexity, the dimensionality of the input space still influences kernel evaluations, especially for large datasets. The optimal hyperplane is derived using the following equation:

F (x, α*, b)

This hyperplane is calculated using support vectors, kernel functions, and bias terms, enabling precise classification. Dimensionality reduction enhances SVM efficiency by reducing the computational overhead required for training and testing [74].

8.3. Text Representation in Dimensionality Reduction

In text categorization, documents are typically represented as a term–document matrix A = (aij)A = (a_)A = (aij), where:

Rows: Represent terms.
Columns: Represent documents.
Entries (aija_aij): Indicate the frequency or presence of term iii in document jjj.

This matrix serves as the basis for clustering and classification tasks, with dimensionality reduction techniques applied to enhance efficiency and accuracy. By leveraging this representation, models can focus on essential features, enabling better performance in high-dimensional spaces [24].

8.4. Common Methods for Term Selection

8.4.1. Document Frequency

This approach selects terms that appear frequently across documents, as frequent terms may have more importance for classification. However, common terms across all documents (like stop words) are typically excluded. The calculation involves determining the amount of text within a collection that contains a specific feature, which may include words, phrases, n-grams, or custom-derived attributes. The counting approach employs a binary method: each time a feature is present in a document, its document frequency (DF) is incremented by one. However, this conventional DF metric focuses solely on the presence or absence of a feature in a text without accounting for the significance or relevance of that feature within the document itself [75]. While the document frequency (DF) metric effectively quantifies the presence of features across a collection of documents, its binary nature overlooks the contextual importance of those features within individual documents. DF’s simple presence/absence counting overlooks feature frequency and relevance variations within a document, leading to an incomplete representation of feature importance in tasks like text classification.

To address these limitations, the term frequency–inverse document frequency (TF-IDF) metric is frequently employed, as it evaluates both the occurrence of a feature within a document and its distribution across the dataset. This results in a more accurate assessment of a feature’s significance. TF-IDF is particularly useful for minimizing irrelevant terms in tasks like text summarization and classification [76]. As a commonly used feature weighting method in the vector space model, TF-IDF is widely applied in text mining and information retrieval. It effectively emphasizes the importance of a term within a document collection, treating all documents equally in its computation [77].

8.4.2. Chi-Square Test

This method evaluates the independence of a term from the document class, selecting terms that show a significant relationship with the target labels. Chi-square tests are used to assess many classes of comparison such as tests of independence and tests of homogeneity [78,79]. Tao and Chang also use the chi-square test to cluster web query schema [80]. The chi-square test, initially introduced by Pearson, has become a widely used statistical tool for assessing relationships between categorical variables, such as testing independence or homogeneity. Its application in clustering tasks, like grouping web query schemas, demonstrates the versatility of the chi-square test beyond traditional statistical analysis. By comparing observed and expected frequencies, the test helps uncover patterns or associations that might not be immediately apparent in raw data. In the context of web queries, chi-square tests can be used to cluster web query schemas based on their content. For example, if a search engine wants to categorize search queries into topics like sports, technology, or health, the chi-square test can assess the relationship between query words and the topics, helping to improve search accuracy. Experiments show that the proposed method improves the performance of text categorization techniques using chi-square (χ²) for feature selection with an F-measure of 92.20% [81].

8.4.3. Mutual Information

This approach evaluates how much information a term contributes to predicting the class label, prioritizing terms that offer the greatest value for classification. Terms are ranked based on their predictive significance, which can be assessed using techniques like document frequency, information gain, mutual information, or the χ² test [81]. The core idea is that the most effective terms are those that exhibit the greatest variation in distribution between positive and negative examples across different categories. These techniques evaluate a term’s ability to distinguish between categories effectively [82]. Document frequency evaluates how frequently a term appears, while information gain measures its significance in predicting a category. Mutual information quantifies the relationship between a term and a category, assessing each term’s capacity to differentiate between categories effectively, while the chi-square test determines their independence. These methods aid in identifying the most important terms, enhancing support vector machine (SVM) training by concentrating on critical features and patterns.

8.4.4. Term Clustering

This technique groups terms that are semantically similar, reducing redundancy in the feature set. By clustering similar terms together, the model can focus on clusters rather than individual terms, improving efficiency. Term clustering phrases derived from syntactic meta-features and indexed based on document or document group co-occurrence are typically of higher quality compared to indexing methods that rely solely on individual syntactic phrases, single indexing words, or word clusters [83]. Term clustering differs from term selection in that it focuses on grouping terms that are synonymous or nearly synonymous, whereas term selection primarily aims to eliminate non-informative terms [43]. The relationships identified within clusters are often incidental rather than the intended systematic connections originally sought [83]. Optimization techniques have a wide range of applications, including clustering and categorizing text documents, engineering, image processing, speech recognition, pattern recognition, weather forecasting, route optimization, wireless sensor networks, and job scheduling, among others [84]. Grouping terms by analyzing syntactic relationships and co-occurrence patterns improves document indexing by capturing contextual meaning while minimizing redundancy. This approach reflects how words work together, rather than treating them as isolated terms. For example, in legal document retrieval, clustering terms like “contract terms” or “legal agreement” enhances search relevance and accuracy.

8.4.5. Principal Component Analysis (PCA)

Principal component analysis (PCA) is a linear dimensionality decrease method that projects data onto the most significant axes, known as principal components. It is a statistical technique designed to reduce dimensionality while minimizing the loss of variance from the original dataset. PCA identifies the directions of maximum variance within the term–document matrix, allowing for a reduction in the number of features while preserving the majority of the data’s variance. This approach is especially valuable for managing sparse or high-dimensional datasets. It achieves this by transforming the initial correlated quantitative variables into new, uncorrelated variables known as principal components [85]. PCA reduces dimensionality by calculating the covariance matrix to identify eigenvectors (principal components) that capture the highest change in the data. These components transform correlated features into uncorrelated ones, simplifying analysis and eliminating redundancy. PCA is widely used for visualization, improving machine learning performance, and handling high-dimensional datasets.

8.5. Comparison of Dimensionality Reduction Methods

Term Extraction Techniques

Document frequency (DF) measures how often a term appears in a document collection. Terms that appear in a lot or fewer documents may not provide useful distinguishing information. This technique is widely used for filtering out common terms, for example, stop words or rare terms. Widely used weighting schemes like term frequency–inverse document frequency (TF-IDF) are used to convert a document into a structured format [86].

The chi-square test measures the dependence between two categorical variables, such as the occurrence of a term and its corresponding document category. It identifies terms that are strongly associated with specific categories, aiding in feature selection. A higher χ² value signifies a stronger relationship between the term and the category. This test is computationally efficient and is commonly used to examine the independence of categorical variables or assess how well a sample aligns with the distribution of a known population (goodness of fit) [87].

Mutual information (MI) quantifies the dependency between two variables (terms). Text analysis measures how much information one term provides about another, capturing both frequency and context. High mutual information values suggest that the term is informative and relevant to the target classification task. Estimating mutual information (MI) accurately is a complex task, and using it as an objective in representation learning often leads to highly entangled representations because of its invariance under arbitrary invertible transformations. However, despite these difficulties, MI-based methods have repeatedly proven to be highly effective in practical scenarios [88].

9. Evaluation of Text Categorization Models

Text categorization is a key task in natural language processing (NLP). The aim of text categorization methods is to associate one (or more) of a given set of categories to a particular document [89]. Evaluating the performance of text categorization models is crucial for understanding their effectiveness and ensuring they perform well in real-world applications. Evaluating the performance of a text categorization model involves the use of various metrics. This section discusses the evaluation of text categorization models, focusing on performance metrics, the F-measure, and challenges associated with model evaluation.

9.1. Metrics for Performance Evaluation

These metrics are used to assess the model’s effectiveness in accurately classifying text into the appropriate categories. Various evaluation measures are commonly employed, such as recall, precision, accuracy, error rate, F-measure or break-even point, micro-average and macro-average for binary classification, and 11-point average precision for ranking categories [90].

9.1.1. Accuracy

Accuracy is the ratio of correctly foreseen instances to the total instances in the dataset. While simple and widely used, it can be misleading in imbalanced datasets.

$Accuracy = \frac{T r u e P o s i t i v e + T r u e N e g a t i v e s}{T o t a l S a m p l e s}$

In text categorization (such as classifying documents into multiple categories or topics), we evaluate model performance using metrics like accuracy (how often the model predicts the correct category) or error rate (how often the model is wrong). However Yang [90] points out key issues when applying these metrics to certain datasets. As a result, a simplistic algorithm that rejects all documents for every category would achieve a global average error rate of 1.3% and a global average accuracy of 98.7%, whether measured on a micro- or macro-scale, as these values would be identical [91]. This does not imply that a trivial rejector classifier is effective; rather, it highlights that accuracy or error alone may not be reliable metrics for evaluating the performance or utility of a classifier in text categorization, especially when the number of categories is large, and each document is associated with only a small number of categories on average [90]. A trivial classifier refers to a model that generates basic, non-informative predictions. Selecting an appropriate performance evaluation metric becomes especially critical when dealing with advanced machine learning methods, such as neural networks, to ensure meaningful and accurate assessments of their predictive capabilities [92]. In this context, the trivial approach refers to a classifier that rejects all documents for every category. Alternatively, a predictor–rejector formulation involves learning both a predictor and a rejector, each derived from distinct families of functions, while explicitly considering the cost of abstaining from making a prediction [93]. In simpler terms, this model consistently predicts that no categories are assigned to any document, earning it the label of a “rejector classifier”. Despite failing to perform any meaningful classification, the rejector classifier could achieve a global accuracy of 98.7%, primarily because many documents in the dataset have very few assigned categories, making them irrelevant to most. While accuracy can be a reliable metric when positive and negative examples are balanced, it becomes misleading in imbalanced scenarios. For instance, if negative examples significantly outnumber positive ones, a system that assigns no documents to any category can still achieve an accuracy value close to 1, even though it provides no useful information for classification, as it fails to differentiate between relevant and irrelevant categories [93].

9.1.2. Precision

Precision (also called positive predictive value) measures the accuracy of positive predictions. It is the ratio of true positives to the total predicted positives [94].

$Precision = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e P o s i t i v e s} \times 100$

Precision plays a critical role in scenarios where the cost of false positives is significant, such as spam detection, where misclassifying a legitimate email as spam can lead to undesirable outcomes. In the field of information retrieval, precision refers to the percentage of retrieved documents that are relevant, while recall represents the percentage of relevant documents successfully retrieved from the total set of relevant documents [95]. Studies have reported impressive results, with recall and precision averaging around 90% on a small subset (3%) of a specific corpus [43]. It is noted that micro-averaged scores (recall, precision, and F1) are predominantly influenced by the classifier’s performance on frequently occurring categories, whereas macro-averaged scores are more impacted by performance on less common categories [90]. Precision becomes especially important in high-cost error cases, such as spam detection, where the misclassification of non-spam emails as spam can have significant repercussions.

9.1.3. Recall (Sensitivity)

Recall (also called sensitivity or true positive rate) shows how well the model identifies all relevant instances. It is the ratio of true positives to the total actual positives [94].

$Recall = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e N e g a t i v e s} \times 100$

Recall is crucial in situations where the cost of false negatives is high, such as in medical diagnostics, where failing to detect a positive case could have serious consequences. Recall is defined as the ratio of correctly identified positive cases to the total number of actual positives. This measure evaluates the system’s ability to identify true positives, with average performance sometimes assessed across different recall thresholds for all test documents [90]. It is particularly significant in cases where missing a positive diagnosis could result in severe outcomes, emphasizing the importance of capturing all relevant instances.

F-Measure

The F1-measure serves as the harmonic mean of precision and recall [94], providing a balanced evaluation of both metrics. It is especially valuable in scenarios with an uneven class distribution, where balancing false positives and false negatives is critical, such as in text classification tasks. The F1-measure is calculated as follows:

$F - Measure = 2 (\frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l})$

The F-measure is commonly employed when achieving a balance between precision and recall is important, such as in text classification tasks where it is necessary to minimize both false positives and false negatives. However, designing an appropriate significance test can be challenging, as the method’s performance is often summarized into a single metric, like the break-even point or the optimized F1-score [90]. Additionally, optimizing predictions to maximize the F1-measure is not always feasible by merely ranking labels based on their relevance and selecting the highest-ranked ones [96].

Table 2 summarizes the mean number of assigned (Assign. Mean) keywords and correct (Corr. Mean) keywords per document, as well as the precision (P), recall (R), and F-measure (F) achieved when extracting 312 keywords per document [97].

9.1.4. Break-Even Point (BEP)

The break-even point represents the point where precision and recall are equal, providing insight into the trade-off between these metrics. Typically, BEP values are interpolated because exact matches of precision and recall are rare. Recall and precision are critical measures in text classification. Recall assesses the capacity to recognize all relevant documents, whereas precision assesses the accuracy of recovered documents. Together, they provide a fair evaluation of the model’s performance. Additionally, the point where precision equals recall is not always meaningful or desirable from the user’s perspective [93]. This also means that the BEP score of a system is always equal to or less than the optimal value of F1 of that system [90]. The BEP score is a more lenient metric than F1, meaning it cannot exceed the optimal F1 score, which balances precision and recall.

9.2. Validation Techniques

Effective validation techniques are critical to evaluate how well a model performs on unseen data. The rapid growth of digital text data has necessitated the development of new methods for text processing and classification [84].

9.2.1. K-Fold Cross-Validation

K-fold cross-validation divides a dataset into “k” subsets. Each subset is used once as a validation set, while the remaining k-1 subsets are used for training. This ensures that each data point is utilized for both training and validation [98]. As “k” increases, the evaluation becomes more stable by averaging the results over more models. However, increasing “k” also requires training more models, making it important to choose an appropriate “k” value [99]. This method is especially useful in fields like healthcare, where it helps assess classification model performance with limited datasets [100].

9.2.2. Train–Test Split

A simpler validation approach is the train–test split, which divides the dataset into two parts: a training set for developing the model and a test set for evaluating its performance. A common split ratio is 80% training and 20% testing, although this may vary. The train–test split is often used in meta-learning, where models are adapted to specific tasks using one subset of data and evaluated on another [101]. While the train–test split trains a single model, cross-validation improves generalization by training multiple models on different data subsets [102].

9.3. Challenges in Model Evaluation

There are only a few lexical databases for a small number of languages, hence knowledge-based systems can be developed only for those languages. Knowledge-based systems are mostly specific in nature for certain languages and subjects, so they cannot easily be used for other languages. These systems can be costly to maintain since languages keep changing. They are also not available for some subjects [84]. Knowledge-based systems rely on lexical databases, which are limited to a few languages and domains, making them costly and hard to adapt. Researchers are urged to develop these resources for underrepresented languages to expand system usability.
Building and implementing a deep learning-based system can be highly resource-intensive, as training such systems requires expensive hardware and significant computational power, which must be accounted for [84].
The meaning relationships of the words in a text document cause problems in text categorization, hence making it hard to create a system. Unsupervised text data are a tough job for obtaining meaning relationships to make text categorization systems [84].

9.4. Challenges in Machine-Learning-Based Text Classification

This section examines the challenges in machine-learning-based text classification, addressing critical issues such as overfitting and underfitting, which impact model generalization; class imbalance, which skews classification results; feature space complexity, which complicates model training and interpretation; and linguistic challenges like ambiguity and polysemy, which hinder accurate text understanding and categorization.

9.5. Overfitting and Underfitting in TC

Overfitting and underfitting pose major challenges to the quality of classification models. Overfitting occurs when a model learns excessively, including noise, leading to excellent performance on training data but poor generalization to unseen data. Both overfitting and underfitting can cause training errors that significantly impact the reliability of deep-learning-based communication systems [103]. Regularization, dropout layers, and data augmentation are techniques that help to prevent overfitting by balancing model complexity and lowering sensitivity to certain parameters. The process of this problem is called generalization, and generalization mainly solves the problem of overfitting [104]. Underfitting happens when a model is overly simplistic in capturing major patterns in the data, resulting in poor performance on both the training and test sets. Underfitting in text classification (TC) can occur as a result of utilizing basic algorithms or insufficient feature extraction approaches, which limits the model’s capacity to recognize linguistic complexity and thematic nuances. To solve underfitting, consider enhancing the model’s complexity by using advanced topologies like transformers or pretrained models, as well as including a diverse variety of data points. Furthermore, approaches such as regularization and dropout can help to prevent overfitting, whereas the inclusion of additional layers or pretrained models helps to reduce underfitting. These changes increase TC models’ generalization power, allowing them to reliably categorize a wider range of text types [105].

9.6. Class Imbalance in TC

The class imbalance problem in text categorization (TC) occurs when certain categories dominate a dataset, while others are underrepresented. This imbalance might cause machine learning models to favor majority classes, resulting in biased predictions. This is especially troublesome in applications such as spam detection or sentiment analysis, where minority classes are important. Addressing class imbalance is critical to ensuring TC models’ robustness and fairness. The first challenge is multiclass imbalance: the rapidly intensifying (RI) and extraordinarily intensifying (EI) classes have significantly fewer training samples in comparison with the neutral and weakening classes [106].

The terms rapidly intensifying (RI) and extraordinarily intensifying (EI) refer to system intensity changes, with RI characterizing a rapid and large increase in strength over a short period of time and EI referring to an unusual, rare escalation in intensity beyond regular patterns. Neutral denotes systems with little or no intensity change, whereas weakening denotes systems losing strength owing to unfavorable conditions. These classifications aid in understanding system behavior, allowing for more accurate analysis and prediction.

Class imbalance is frequently the result of natural data distribution. Sports and politics, for example, may have significantly more data than specialty fields such as environmental news, particularly in user-generated content or real-time applications. An imbalance in class distribution skews models toward the majority class, reducing their ability to generalize effectively across different scenarios. To address this, data-level methods like the synthetic minority oversampling technique (SMOTE) are used to balance the dataset by oversampling minority classes and undersampling majority classes. However, classifiers trained and evaluated on increasingly imbalanced datasets often exhibit artificially inflated classification accuracy, which can be misleading [106]. Algorithmic approaches modify the learning process by allocating higher weights to minority classes, with techniques such as boosting and bagging being useful. Advanced models such as BERT and GPT, through fine-tuning and cost-sensitive learning, aid in minority class recognition in highly skewed datasets [107].

9.7. Complexity in Feature Space

Gaining a deeper understanding of the distribution of patterns within the feature space can provide valuable insights into the difficulty and complexity of various classification tasks [108]. The feature space in text categorization (TC) refers to the structured dimensions or variables used to process text input for machine learning. Text is called unstructured since it consists of words, phrases, and syntax that do not follow a predefined format or numerical representation. Unlike structured data, such as tables or spreadsheets, text requires processing techniques such as tokenization and embedding to numerically represent its semantic and grammatical qualities, resulting in a multidimensional feature space. This intricacy can make it difficult for models to train successfully, resulting in significant computing costs and the danger of overfitting. To address these difficulties, feature selection and dimensionality reduction approaches can manage feature space complexity while retaining critical information [39].

Classifiers trained on datasets with increasing levels of class imbalance and evaluated under the same conditions often exhibit an artificially inflated classification accuracy, which can be misleading [108]. The high complexity of text data raises computational demands and makes it difficult to differentiate relevant aspects. Feature space, like the physical universe, is very sparsely populated [108]. Sparse data points in a large feature space can make generalization difficult and increase training time. Simple representations, such as bag-of-words, may fail to express linguistic nuances, especially when dealing with polysemy and synonyms. A higher-dimensional feature space is required to cope with this more complex situation [109]. Feature engineering is critical to make text data more manageable and understandable. Methods like term frequency–inverse document frequency (TF-IDF) and n-grams help models identify important terms and phrase structures. Word embeddings, including Word2Vec, GloVe, and fastText, provide compact, dense representations that enhance generalization across related concepts. More advanced embeddings, such as BERT and GPT, go further by generating contextualized representations that capture the meanings of words based on their surrounding context [110].

Dimensionality reduction techniques such as principal component analysis (PCA), singular value decomposition (SVD), and autoencoders condense the feature space by preserving only the most significant features. This not only enhances model interpretability but also reduces training time, making the models more efficient. Modern embedding models such as BERT and GPT improve TC by incorporating contextual nuances, increasing model accuracy for complex languages. While these developments improve TC, they also raise interpretability concerns. Deep learning models and embeddings are frequently viewed as “black boxes”, which is especially troublesome in industries requiring explanation, such as healthcare or finance. Attention mechanisms and explainable AI (XAI) tools help to emphasize significant elements while balancing feature complexity and interpretability, allowing practitioners to make educated decisions in complicated language processing tasks [111].

9.8. Ambiguity and Polysemy in Language

Ambiguity and polysemy provide substantial issues in natural language processing (NLP), particularly in tasks such as text categorization. Ambiguity occurs when a term or phrase has many meanings, such as “bank” referring to a financial organization or a riverbank. Polysemy is a type of ambiguity in which words have multiple related meanings, such as “run” for physical exercise or executing a program. These phenomena hamper model performance because they require context for accurate interpretation, which standard models struggle with. Ambiguity creates confusion in TC, where context is critical for accurate classification [112]. For example, a headline like “local bank raises funds” requires contextual expertise to discern between financial and non-financial issues. Simple models frequently misclassify such scenarios, and even neural models such as transformers can fail when contextual cues are not apparent or need cultural knowledge, emphasizing the importance of advanced context handling strategies [113].

Polysemy is especially difficult since static word embeddings cannot record multiple meanings across contexts. Words like “light” can relate to either brightness or weight, depending on the context. Contextual embeddings, such as those used in BERT and GPT, address this by dynamically modifying meanings based on surrounding words, although complex phrases and nuanced interpretations continue to pose issues. Multilingual NLP complicates TC by varying ambiguity and polysemy across languages. Some languages use morphology to resolve ambiguity, while others rely significantly on context, which complicates operations like machine translation. To deal with these challenges, multilingual models such as mBERT are trained on a variety of datasets, although linguistic diversity still presents limits [114].

There are several ways to deal with ambiguity and polysemy. Domain-specific models improve context and reduce misclassification, while auxiliary tasks such as part-of-speech tagging help clarify meaning. Ensemble models, which incorporate predictions from many models, improve overall performance. Although effective, these techniques are computationally expensive, demonstrating that ambiguity and polysemy remain key issues in NLP [115].

Advancements and Emerging Trends in Text Categorization (TC)

Recent advances in TC reflect a paradigm shift away from traditional machine learning methods and toward deep learning and hybrid methodologies. These advancements enable better feature extraction, contextual comprehension, and flexibility across languages and domains, broadening the scope of TC’s practical applications. This section explores deep learning approaches for text categorization, focusing on the application of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) for various text classification tasks. It also highlights the transformative impact of transfer learning and pretrained language models, such as BERT and GPT, in advancing text categorization with contextual understanding and reduced training requirements.

9.9. Deep Learning for Text Categorization

With its capacity to identify complex patterns in high-dimensional data, deep learning has transformed text classification by allowing models to learn directly from raw text with minimal feature engineering. Deep learning algorithms, particularly convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have exhibited significant promise in text classification tasks, excelling at capturing local and sequential relationships.

9.10. CNNs and RNNs for TC Tasks

CNNs and RNNs are two of the most popular TC architectures due to their distinct ability to process and comprehend textual data. CNNs, which have typically been employed in image processing, have been adapted for text classification by applying convolutional filters on word embeddings or n-gram representations. This technique recognizes local word patterns and is especially beneficial for short text categorization, such as sentence-level sentiment analysis [116]. CNNs’ hierarchical feature extraction technique finds relevant phrases and concepts, making them ideal for context-dependent document classification tasks [117].

“RNNs, notably Long Short-Term Memory (LSTM) networks, have also proven useful for TC because of their sequential character, allowing them to effectively model dependencies across phrases and paragraphs” [118]. Recurrent neural networks (RNNs) are a type of neural network architecture which is mainly used to detect patterns in a sequence of data [119]. The sequential learning capabilities of these models are especially useful in TC tasks that need large documents with complicated language structures.

9.11. Transfer Learning and Pretrained Language Models

Transfer learning, particularly through pretrained language models, represents a significant advancement in text classification (TC). By leveraging knowledge from vast and diverse text corpora, it reduces the reliance on extensive labeled datasets, thereby enhancing the accessibility of text classification for low-resource languages and niche domains.

9.11.1. Use of BERT, GPT, and Similar Models

Pretrained language models like Bidirectional Encoder Representations from Transformers (BERT), Generative Pre-trained Transformer (GPT), and related architectures have revolutionized text classification (TC). These models are pretrained on extensive corpora and can be fine-tuned with minimal additional training for specific tasks, setting new benchmarks in performance and efficiency. “BERT, for example, uses a bidirectional attention mechanism to record the context of words from both left and right contexts, resulting in more nuanced understanding in TC applications” [120]. BERT’s deep bidirectional methodology makes it particularly successful for context-dependent tasks like sentiment analysis and topic classification.

GPT, on the other hand, employs a unidirectional transformer architecture, excelling at producing coherent, contextually relevant text and doing well on tasks requiring text production or completion [121]. For TC, GPT and its descendants, such as GPT-3, have demonstrated exceptional performance in few-shot and zero-shot classification scenarios, decreasing reliance on labeled data and facilitating fast knowledge transfer between languages and domains [122].

The introduction of these models significantly improved TC capabilities, allowing classifiers to function with minimum task-specific input while maintaining high levels of accuracy. Their efficacy across a variety of TC applications demonstrates transfer learning’s promise for dealing with complicated and developing text collections.

9.11.2. Hybrid Approaches Combining Knowledge Engineering and ML

Hybrid approaches that integrate knowledge engineering with machine learning are gaining traction, effectively bridging the gap between rule-based systems and data-driven methods. A SWOT analysis of the ten most frequently cited algorithms from a curated collection of peer-reviewed studies and research publications reveals the strengths and weaknesses of traditional algorithms while uncovering the opportunities and challenges that hybrid methods aim to address [123]. These methods incorporate human-defined rules and domain expertise into machine learning models, enhancing the interpretability and robustness of text classification (TC) systems.

In recent years, other hybrid physics–ML models have been developed, extending beyond residual modeling. A simple method to integrate physics-based and ML models involves using the output of a physics-based model as input for an ML algorithm [124]. Within hybrid TC systems, knowledge engineering is often applied to create initial feature sets or rules that feed into machine learning algorithms. For instance, domain-specific ontologies or taxonomies can guide feature selection, enabling the model to capture critical semantic details relevant to the categorization task. This approach is particularly effective in specialized fields such as healthcare or legal document categorization, where domain expertise is crucial for achieving accurate classification [35].

10. Future Directions and Research Opportunities

This section describes both broad future directions and specific research opportunities in text categorization (TC), focusing on developing trends and challenges in practical applications.

10.1. Multilanguage and Cross-Cultural Text Classification

This subsection discusses the challenges and advancements in creating inclusive TC systems that address linguistic and cultural diversity.

10.1.1. Importance of Cross-Language Communication

In today’s interconnected global landscape, seamless cross-language communication is essential. As language diversity persists as a barrier, domains like multilingual translation and text summarization are reaching a critical juncture, requiring innovative automated solutions [125]. Text classification models, which often rely on large-scale labeled datasets, are typically tailored for specific languages and cultural contexts. This limitation underscores the growing demand for systems capable of addressing linguistic and cultural diversity in an increasingly interconnected world [126].

10.1.2. Advancements in Multilingual NLP

New multilingual datasets featuring conversations in Chinese, English, Korean, and Japanese provide a robust foundation for developing powerful conversational AI systems [126]. Pretrained models like BERT have expanded their capabilities to include multilingual versions such as mBERT and XLM-R. These models enable simultaneous processing of diverse linguistic inputs, enhancing cross-language text classification [127].

10.1.3. Cross-Lingual Transfer Learning

Cross-lingual transfer learning, facilitated by both social and machine translation, plays a pivotal role in multilingual text classification. Many multilingual datasets are generated through professional translations, while machine translation is frequently employed to translate training or test sets. Despite these advancements, challenges remain, such as the lack of standardized multilingual datasets annotated under consistent guidelines, particularly for intent detection and slot filling tasks [128,129].

10.1.4. Cultural Sensitivity in Text Classification

Text classification systems must navigate cultural nuances, including idiomatic expressions, societal norms, and sentiment variations across regions. For example, positive or neutral sentiment expressions can differ significantly between cultures, affecting sentiment analysis accuracy. Translators must ensure cultural appropriateness, preserving the natural tone and relevance for the target audience.

10.1.5. Future Research Directions

Universal Multilingual Models

Developing generalized models capable of learning across multiple languages with minimal reliance on labeled data is a critical research direction. Universal multilingual models such as XLM-R and mBERT have laid the groundwork, but further advancements are needed to enhance their adaptability to low-resource languages and diverse linguistic contexts. By leveraging transfer learning, cross-lingual embeddings, and domain adaptation, these models can facilitate effective communication and analysis across linguistic and cultural barriers.

Low-Resource Languages

Addressing data scarcity in low-resource languages remains a significant challenge. Techniques such as unsupervised learning, self-supervised approaches, and domain-specific transfer learning can mitigate these limitations. For instance, multilingual pretrained models can be fine-tuned for specific low-resource languages, enabling their inclusion in broader applications and ensuring global inclusivity. Integrating machine translation and text classification duties could also enhance the usability of these models in multilingual environments.

Enhanced Language Identification

Future text categorization systems must incorporate advanced language identification techniques to process user-generated content that often includes multiple languages. Methods such as combining deep learning with linguistic rules can improve accuracy in detecting and processing code-switching and mixed-language texts. This capability is essential for applications in social media monitoring, global marketing, and multilingual customer support, where accurate language identification is critical [130].

Cultural Awareness in Models

Embedding cultural sensitivity into text categorization models is vital for improving their classification accuracy and relevance in diverse contexts. Cultural nuances, idiomatic expressions, and societal norms influence language usage and sentiment expression, which models must understand to perform effectively. Incorporating cultural awareness into training data and leveraging cross-cultural embeddings can enhance the adaptability and inclusivity of these systems.

Integration with Real-Time and Multimodal Systems

The integration of text categorization with real-time processing and multimodal systems is another promising research avenue. Real-time categorization systems must handle dynamic data streams with minimal latency while maintaining accuracy. Combining text with visual and audio inputs, such as in social media content analysis, could provide richer contextual understanding and enhance classification outcomes. Edge computing and incremental learning techniques can support this shift toward dynamic, real-time systems.

Ethical AI, Transparency, and Bias Mitigation

Ensuring ethical AI methods and transparency is especially important in sensitive applications such as recruitment, healthcare, legal analytics, and public safety. Addressing biases in training data and algorithms necessitates frameworks for bias identification, mitigation, and explainable AI (XAI) strategies that promote trust and responsibility. Incorporating ethical considerations into model design ensures fair, transparent, and impartial results across varied demographic and cultural contexts, boosting user trust in real-time text classification systems.

Hybrid and Explainable Models

Combining machine learning with rule-based systems offers a promising avenue for creating interpretable and robust text categorization models. Hybrid models can balance precision and transparency, making them more suitable for high-stakes applications. Explainable AI approaches will play a crucial role in enabling users to recognize and trust the decision-making processes of these copies.

By addressing these research directions, the next generation of text categorization systems can achieve greater inclusivity, adaptability, and ethical integrity. These advancements will not only refine technical performance but also ensure that text categorization technologies remain relevant and impactful in an increasingly interconnected and data-driven world.

10.2. Addressing Emerging Challenges in Multilingual Text Classification

To enhance the inclusivity and adaptability of text classification (TC) systems, addressing emerging challenges in multilingual and cross-cultural contexts has become a pressing need. While advancements in pretrained multilingual models and cross-lingual transfer learning have set a strong foundation, several critical areas demand focused research.

10.2.1. Low-Resource Language Challenges

Despite progress, low-resource languages continue to pose significant challenges. The scarcity of labeled datasets and the diversity in linguistic structures impede the development of robust TC systems for these languages. Techniques such as unsupervised learning, few-shot learning, and synthetic data generation can help bridge this gap. For example, using generative AI models to create synthetic training data for low-resource languages could expand their application scope in multilingual environments.

10.2.2. Adaptive Multilingual Systems

Dynamic multilingual systems that can adapt to real-time user needs and cultural contexts represent a promising direction. Innovations in adaptive embeddings, context-aware processing, and reinforcement learning for linguistic and cultural adaptation are critical to enabling seamless multilingual applications. These systems must also address code switching, mixed-language text processing, and evolving regional dialects to ensure relevance and accuracy.

By tackling these challenges, text classification systems can better support global communication needs, fostering inclusivity and equity in an increasingly interconnected world.

10.3. Real-Time Text Categorization Applications

This subsection explores the demands and opportunities of deploying TC systems in real-time environments where immediate decision making is critical.

10.3.1. The Need for Real-Time Classification

Real-time text categorization enables immediate processing and classification of newly generated content, bypassing the need for batch operations. This capability is critical for applications such as social media monitoring, content filtering, and customer support, where real-time decision making is essential [13].

10.3.2. Scalability and Speed in Real-Time Systems

The high volume and rapid generation of content on social media and news platforms demand systems that are both fast and scalable. For instance, integrating report texts with tweets containing relevant links has been shown to improve classification outcomes in real-time environments [131].

10.3.3. Incremental Learning for Dynamic Content

Real-time systems thrive in dynamic environments by employing incremental learning techniques. These approaches allow models to continuously adapt to new data, enhancing their robustness in ever-changing contexts. Lifelong learning frameworks provide methods for task-incremental, domain-incremental, and class-incremental learning, bridging the gap between natural and artificial intelligence [132].

10.3.4. Latency-Aware Optimization

Reducing latency while preserving accuracy is crucial for real-time systems, particularly those deployed on resource-constrained devices. Techniques such as knowledge distillation, model pruning, edge computing, and optimized inference algorithms minimize computational demands while maintaining high performance. These strategies enable efficient and energy-conscious processing, making them essential for latency-sensitive applications like real-time sentiment analysis and content moderation [48]

10.3.5. Future Research Opportunities

High-Throughput Systems

Developing models capable of processing large-scale, real-time data streams with minimal latency remains a top priority. Future systems must leverage advanced technologies such as edge computing, model distillation, and parallel processing to handle massive volumes of data without sacrificing accuracy. High-throughput systems can play a critical role in applications like live news categorization, stock market analysis, and emergency response, where rapid decision making is essential.

Dynamic Adaptation

Real-time content is highly dynamic, with patterns and trends shifting quickly. To maintain relevance and accuracy, it is essential to enhance models to adapt to these changes. Incremental learning techniques, which enable models to update and evolve without requiring complete retraining, offer a particularly effective solution. These methods can be combined with continual learning frameworks to create systems that seamlessly adjust to new topics, terms, and contexts over time.

Applications in Diverse Domains

Real-time text categorization offers significant potential across various fields:

Social Media Analytics: Identifying trends, sentiment, and emerging topics in real time.
Spam Detection: Filtering spam messages or malicious content as it is generated.
Fraud Prevention: Monitoring financial transactions or communications for suspicious patterns.
Customer Support Chatbots: Providing instant, context-aware responses to user queries.

10.3.6. Real-Time Multimodal Integration

Combining text with other data modalities, such as images, videos, and audio, presents an exciting research direction. For instance, analyzing text alongside accompanying visuals in social media posts could provide richer insights into user intent and sentiment. Multimodal approaches will be critical for applications like live event monitoring and personalized content delivery, where a holistic understanding of data is necessary.

10.3.7. Scalability for Global Applications

With the growing global nature of data, scalable systems capable of processing multilingual and culturally diverse content in real time are needed. Advances in cross-lingual embeddings, transfer learning, and domain adaptation will enable models to handle diverse data streams efficiently. This scalability is particularly important for global platforms that deal with multilingual user bases, such as international social media networks and e-commerce platforms.

10.3.8. Context-Aware Personalization

Future systems should aim to provide personalized categorizations by incorporating user preferences, location, and historical interactions. Context-aware models can improve the relevance and utility of real-time classifications in applications like targeted marketing, personalized news feeds, and adaptive recommendation systems.

10.4. Integration with Other NLP Tasks

10.4.1. Expanding the Scope of NLP Integration

Integrating various NLP tasks such as named entity recognition (NER), parsing, sentiment analysis, and information extraction into text classification can significantly improve system performance. These tasks enable models to derive deeper insights from textual data, supporting more complex applications.

10.4.2. Named Entity Recognition (NER)

NER identifies entities like names, locations, and organizations within the text, enhancing classification accuracy for domain-specific tasks such as medical or legal document analysis. This task is critical for structured data extraction in applications like information retrieval and question answering [133,134].

10.4.3. Parsing Techniques

Parsing systems analyze sentence structure and relationships between words, aiding models in understanding both syntactic and semantic nuances. These insights enable more accurate distinctions between text types, such as formal articles versus informal blog posts by analyzing specific linguistic and structural features unique to each type [134]. For example, formal articles often have a higher lexical richness, precise terminology, and well-structured arguments, which are frequently supported by citations and an objective tone. In contrast, informal blog postings use conversational language, personal tales, and emotive emotions to engage readers on a deeper level. By detecting and measuring these traits, the insights improve text classification, resulting in sophisticated comprehension and context-specific applications for academic research, content curation, and targeted marketing.

10.4.4. Information Extraction (IE)

IE techniques automatically identify structured data within unstructured text. This functionality is particularly useful in applications like legal document analysis and automated data entry, where structured outputs are crucial [135].

10.4.5. Multitask Learning Frameworks

Multitask learning involves training models to handle several NLP tasks simultaneously, leading to richer feature representations and improved overall performance. For example, integrating text summarization and sentiment analysis within a single model can yield more nuanced outcomes [59,136].

10.5. Advancing Multimodal Text Classification

10.5.1. Combining Modalities for Comprehensive Analysis

Multimodal classification combines textual data with other data types, such as images, videos, or audio, providing a holistic understanding of user-generated content. For example, social media platforms can analyze both text and accompanying images to classify posts more effectively.

10.5.2. Practical Applications

E-Commerce: Platforms can integrate sentiment analysis and NER to classify product reviews, extract brand mentions, and monitor customer feedback in real time.
Social Media: By combining text-based sentiment analysis with image-based emotion detection, platforms can enhance their content moderation and analytics capabilities.

10.5.3. Future Research Directions in Multimodal Text Classification

The addition of multiple modalities in text classification opens up new opportunities for advancing the field. Beyond the current applications in e-commerce and social media, innovative research can explore the following directions:

10.5.4. Dynamic Multimodal Fusion Techniques

Future research should focus on developing advanced techniques for dynamically fusing multimodal data. This includes creating adaptive models that can weigh the importance of text, images, videos, and audio based on the context of the task. For instance, a news categorization system might prioritize textual content for breaking news and image content for photojournalism.

10.5.5. Temporal Multimodal Analysis

Investigating the temporal aspects of multimodal data, such as analyzing how user sentiment evolves over time across different modalities, could be a valuable direction. This is particularly relevant for applications like campaign monitoring, where text, images, and videos are generated sequentially and provide evolving narratives.

10.5.6. Real-Time Multimodal Interaction

Building systems capable of real-time multimodal interaction presents an exciting challenge. For instance, integrating live video feeds with chat-based textual input can enhance virtual events, online education, and telemedicine. These systems would need to process and classify data across modalities simultaneously, ensuring high responsiveness and accuracy.

10.5.7. Cross-Modal Transfer Learning

Future work could explore cross-modal transfer learning, where knowledge from one modality (e.g., textual embeddings) is transferred to another (e.g., image features) to improve performance. This approach can be particularly effective in domains where one modality has abundant labeled data while another is scarce.

10.5.8. Domain-Specific Multimodal Solutions

Developing domain-specific multimodal frameworks tailored to fields like healthcare, finance, or legal analysis can drive significant progress. For instance:

Healthcare: Analyzing patient notes alongside medical images for enhanced diagnostic accuracy.
Finance: Integrating financial reports (text) with market trend graphs (visuals) to improve investment decision making.
Legal Analysis: Combining contract text with associated diagrams or annotations to classify clauses efficiently.

10.5.9. Augmented Reality (AR) and Virtual Reality (VR) Integration

As AR and VR applications grow, research could focus on integrating multimodal text classification into these environments. For example, AR systems could analyze spoken words, gestures, and textual annotations in real time to assist users in educational or professional contexts.

10.5.10. Emotion and Context Detection

Future systems could explore more nuanced emotion and context detection by combining textual sentiment analysis with facial expressions, voice tones, and visual cues. This could significantly enhance applications in customer service, mental health analysis, and human–computer interaction.

10.5.11. Energy-Efficient Multimodal Models

Multimodal classification systems are computationally intensive. Research into energy-efficient architectures, such as low-power neural networks and efficient hardware accelerators, can make these systems more accessible for real-world deployment, especially on mobile and edge devices.

10.5.12. Interactive Multimodal Systems

Interactive systems that allow users to provide real-time feedback on classifications can improve model accuracy and adaptability. For instance, a system analyzing tweets and images could adjust its categorization based on user input, ensuring more accurate classifications.

10.5.13. Multimodal Anomaly Detection

Expanding research to include anomaly detection in multimodal data streams can enhance applications like fraud detection, cybersecurity, and disaster response. For example, detecting inconsistencies between textual content and visual evidence can flag potentially fraudulent activities.

By pursuing these directions, multimodal text classification can evolve into a more versatile, context-aware, and impactful tool, enabling transformative applications across industries and societal domains. Table 3 summarized the future research in text categorization.

11. Conclusions

The field of text categorization (TC) has experienced significant evolution, becoming a foundational component in natural language processing (NLP) and machine learning (ML). Transitioning from manual classification to scalable, ML-driven methods has revolutionized the ability to process, organize, and analyze large-scale textual data across various domains. Advances such as supervised learning, feature engineering, and dimensionality reduction have greatly improved the accuracy and efficiency of TC systems, making them critical for applications like sentiment analysis, spam detection, and domain-specific categorization. Despite these achievements, challenges like overfitting, class imbalance, language complexity, and high computational demands remain, underscoring the need for innovations in model interpretability and robustness.

Deep learning techniques, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and pretrained models like BERT and GPT, have expanded the capabilities of TC by enabling advanced language understanding and contextual analysis. However, their dependence on large datasets and high computational power limits their practicality, especially for low-resource languages and real-time applications. Addressing these constraints requires a focus on developing efficient learning techniques, hybrid approaches, and explainable AI (XAI) solutions. Combining machine learning with knowledge engineering can result in interpretable and reliable models, while integrating TC with other NLP tasks, such as text summarization, named entity recognition (NER), and sentiment analysis, has the potential to create more intelligent and context-aware systems.

The future of TC lies in its ability to adapt to the demands of an increasingly interconnected and data-driven world. Multilingual and cross-cultural applications, real-time systems, and multimodal integration are poised to shape the next wave of advancements in the field. These developments will not only enhance the scalability and precision of TC systems but also democratize access to AI technologies, fostering inclusivity and global applicability. Furthermore, addressing ethical concerns, such as bias mitigation and transparency, will be critical to building trust and ensuring equitable outcomes in high-stakes applications like recruitment, healthcare, and legal analytics.

By refining technical performance and enhancing real-world relevance, text categorization systems are positioned to play a pivotal role in information retrieval, data mining, and decision making. The integration of advanced algorithms, ethical frameworks, and interdisciplinary approaches will drive innovation, enabling TC systems to overcome existing challenges while unlocking unprecedented opportunities across industries. As researchers and practitioners collaborate to push the boundaries of what is possible, the future of TC promises transformative impacts on how we process, understand, and derive value from textual data.

Establishing unified methodologies and benchmarks, such as cross-domain datasets and standardized evaluation metrics, could significantly improve comparability and reproducibility. Incorporating quantitative data in future studies, such as specific performance metrics, would further enhance the practical relevance of TC systems. For example, deep learning models like BERT have demonstrated over 97% accuracy in contextual classification tasks, while naive Bayes algorithms continue to offer reliable performance with accuracies exceeding 95% in less complex domains. The potential of explainable AI (XAI) in making TC systems more interpretable is critical for applications in sensitive fields such as healthcare and legal analytics. Techniques such as attention mechanisms, which visually highlight key decision-influencing features, can improve transparency and foster trust in these models. Additionally, domain-specific pretrained models have shown to improve classification accuracy by 10–15%, particularly in specialized industries like medical diagnostics and legal text processing.

Future advancements in TC are likely to be shaped by multilingual and cross-cultural applications, real-time processing systems, and multimodal integration (e.g., combining text with visual or audio data). These advancements will not only enhance scalability and precision but also democratize access to AI technologies, fostering inclusivity across global applications. Ethical considerations, such as bias mitigation and fairness, must also remain a priority. For instance, algorithmic adjustments to balance class distributions have been shown to reduce bias by up to 20%, ensuring fairer outcomes in high-stakes domains like recruitment and healthcare.

By addressing these gaps and challenges, the integration of advanced algorithms, ethical frameworks, and interdisciplinary approaches will drive innovation in the field. This evolution positions TC systems as transformative tools, capable of unlocking unprecedented opportunities in how we process, understand, and derive value from textual data.

12. Case Studies

The rapid growth of digital data has transformed how organizations and researchers manage and extract value from unstructured text. Text categorization, powered by machine learning and artificial intelligence, has emerged as a cornerstone in enabling this transformation. It allows the automatic organization, analysis, and retrieval of textual information with unprecedented speed and accuracy. From detecting spam emails to classifying patient records in healthcare, the applications of text categorization are as diverse as they are impactful.

This section presents five real-world case studies that exemplify the practical implementation of text categorization across various domains. These examples are drawn from industries such as technology, academia, media, customer service, and healthcare, highlighting the versatility and adaptability of machine learning techniques. Each case study demonstrates how state-of-the-art algorithms, ranging from naive Bayes and k-nearest neighbors to cutting-edge transformer models like BERT, have revolutionized workflows, improved decision making, and delivered tangible benefits. By showcasing these applications, the case studies aim to bridge the gap between theoretical advancements and practical deployment. They serve as a testament to the potential of text categorization to address complex challenges, drive innovation, and unlock new possibilities in data-driven environments. Whether you are an academic researcher, an industry practitioner, or a technology enthusiast, these cases provide valuable insights into how text categorization is shaping the modern world.

12.1. Case Study 1: Spam Detection in Email Systems

A study conducted by Google applied machine learning techniques to classify emails as spam or non-spam. By using naive Bayes and support vector machines (SVMs) (Google LLC, Mountain View, CA, USA), the project achieved an accuracy of 95% on a dataset of 10,000 emails. The feature selection process involved using term frequency–inverse document frequency (TF-IDF) to improve the model’s ability to identify spam-related keywords. This application demonstrated the potential of text categorization in reducing manual efforts and improving efficiency in email filtering systems.

12.2. Case Study 2: Sentiment Analysis for Product Reviews

Researchers at Stanford University analyzed customer sentiment in product reviews using deep learning. They implemented a recurrent neural network (RNN) with long short-term memory (LSTM) units, achieving a sentiment classification accuracy of 88% on a dataset containing 50,000 reviews. This work highlighted the effectiveness of deep learning in understanding customer feedback and tailoring marketing strategies.

12.3. Case Study 3: Customer Support Chat Categorization

Zendesk, a leading customer service software company, implemented text categorization to automate the tagging and routing of customer support tickets. By using a combination of random forests and BERT-based transformer models, they achieved a classification accuracy of 93% across a dataset of 15,000 tickets. This system improved response times by 30% and enhanced customer satisfaction, demonstrating the value of automated categorization in customer service operations.

12.4. Case Study 4: News Article Categorization

Reuters News Agency developed a text categorization model to classify articles into topics such as “Politics”, “Sports”, and “Entertainment”. The approach utilized a combination of CNNs and word embeddings, achieving a classification accuracy of 90% on a dataset of 100,000 articles. This real-world application showcased how automated text categorization can streamline news organization and enhance reader engagement.

12.5. Case Study 5: Healthcare Document Analysis

A healthcare organization, Mayo Clinic, leveraged machine learning to categorize patient records based on diagnoses. By applying k-nearest neighbors (kNN) and random forests, the system achieved an F1-score of 87% on a dataset of 20,000 patient records. This initiative facilitated faster retrieval of patient information, improving decision making in clinical settings.

Author Contributions

Conceptualization, H.A. and L.M.; methodology, H.A.; software, H.A.; validation, H.A., L.M. and B.G.; formal analysis, H.A.; investigation, H.A.; resources, H.A.; data curation, H.A.; writing—original draft preparation, H.A.; writing—review and editing, H.A. and L.M.; visualization, H.A.; supervision, K.N.G.; project administration, K.N.G.; funding acquisition, K.A. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

View Image - Figure 1. Comparison of hard categorization (binary decision) and ranking categorization (multiple categories ranked) Hard categorization [38].

Figure 1. Comparison of hard categorization (binary decision) and ranking categorization (multiple categories ranked) Hard categorization [38].

Figure 2. Ranking categorization [38]. (a) represents hard categorization while (b) represents ranking categorization.

Figure 4. CNN Sequence to Classify Digits [60].

Figure 5. How the Vector Space Model Works [64].

Figure 6. How Bag-of-Words Model Works [66].

Table 1

Key Contributions in Text Categorization Research. Recent Studies on Text Classification from 2019 to 2024.

Publication Type	Title	Year	Authors	Objectives	Insights	Practical Implications
Journal Article	Research on Intelligent Natural Language Texts Classification	2022	[11]	- Summarize and compare text classification methods. - Explore development direction of text classification research.	The paper summarizes previous studies on text classification, highlighting the rapid development of machine learning technologies and the diversification of research methods. It compares classification methods based on technical routes, text vectorization, and classification information processing for further research insights.	- Intelligent classification enhances efficient use of natural language texts. - Provides references for further research in text classification methods.
Journal Article	The Research Trends of Text Classification Studies (2000–2020): A Bibliometric Analysis	2022	[12]	- Evaluate the state of the art of TC studies. - Identify publication trends and important contributors in TC research.	The study analyzes 3121 text classification publications from 2000 to 2020, highlighting trends, contributors, and disciplines. It reveals increased interest in advanced classification algorithms, performance evaluation methods, and practical applications, indicating a growing interdisciplinary focus in text classification research.	- Recognizes recent trends in text classification research. - Highlights importance of advanced algorithms and applications.
Journal Article	A survey on text classification and its applications	2020	[13]	- Overview of existing text classification technologies. - Propose research direction for text mining challenges.	Previous studies on text classification have proposed various feature selection methods and classification algorithms, addressing challenges such as scalability due to the massive increase in text data. These studies highlight the importance of effective information organization and management in diverse research fields.	- Important applications in real-world text classification. - Addresses challenges in text mining and scalability.
Journal Article	A Survey on Text Classification: From Traditional to Deep Learning	2022	[14]	- Review state-of-the-art approaches from 1961 to 2021. - Create a taxonomy for text classification methods.	The paper reviews state-of-the-art approaches in text classification from 1961 to 2021, highlighting traditional models and deep learning advancements. It discusses technical developments and benchmark datasets and provides a comprehensive comparison of various techniques and evaluation metrics used in previous studies.	- Summarizes key implications for text classification research. - Identifies future research directions and challenges.
Book Chapter	Case Studies of Several Popular Text Classification Methods	2023	[15]	- Evaluate automatic language processing techniques for text classification. - Analyze and compare the performance of various text classification algorithms.	The paper discusses various text classification methods, highlighting that deep learning models, particularly distributed word representations like Word2Vec and GloVe, outperform traditional methods such as bag-of-words (BOW). Contextual embeddings like BERT also show significant performance improvements.	- Improved text classification methods for massive data analysis. - Enhanced performance using advanced feature extraction techniques.
Journal Article	Text Classification Using Deep Learning Models: A Comparative Review	2023	[16]	- Analyze deep learning models for text classification tasks. - Address gaps, limitations, and future research directions in text classification.	The paper conducts a literature review on various deep learning models for text classification, analyzing their gaps and limitations. It highlights previous studies’ comparative results and discusses classification applications, guiding future research directions in this field.	- Guidance for future research in text classification. - Highlights challenges and potential directions in the field.
Journal Article	Survey on Text Classification	2020	[17]	- Classify documents into predefined classes effectively. - Compare various text representation schemes and classifiers.	Previous studies on text classification have utilized various techniques, including supervised learning with labeled training documents, naive Bayes, and decision tree algorithms. Challenges include the difficulty of creating labeled datasets and the limited applicability of individual classifiers across different domains.	- Detailed information on text classification concepts and algorithms. - Evaluation of algorithms using common performance metrics.
Journal Article	The Text Classification Method Based on BiLSTM and Multi-Scale CNN	2024	[18]	- Overview of deep learning in text classification. - Analyze research progress and technical approaches.	Previous studies on text classification have transitioned from traditional machine learning methods to deep learning models, including attention mechanisms and pretrained language models, highlighting significant progress and challenges in enhancing model performance and dataset quality across various domains.	- Overview of deep learning text classification methods. - Analysis of labeled datasets for research support.
Journal Article	Research on Text Classification Method Based on NLP	2023	[19]	- Describe text classification concepts and processes. - - Explore deep learning models for text classification.	Previous studies on text classification have explored various methods, including LSTM-based multitask learning architectures, capsule networks, and hybrid models like RCNNs, demonstrating advancements in feature extraction and improved performance in tasks such as sentiment analysis and spam recognition.	- Text classification methods are important for effectively classifying text-based data. - - New ideas such as word embedding models and pretraining models have made great progress in text classification.
Book Chapter	A Comparative Study on Various Text Classification Methods	2020	[20]	- Analyze methods for efficient text classification. - - Examine featurization techniques and their performance.	The paper does not provide a review of previous studies on text classification. Instead, it focuses on analyzing various text classification methods and featurization techniques, such as bag-of-words, Tf-Idf vectorization, and Word2Vec approaches.	- Analyzes efficient text classification methods for decision making. - Discusses various featurization techniques for improved performance.
Journal Article	Evaluating text classification: A benchmark study	2024	[21]	- Investigate necessity of complex models versus simple methods. - Assess performance across various classification tasks and datasets.	The paper highlights a gap in existing literature, noting that previous research primarily compares similar types of methods without a comprehensive benchmark. This study aims to provide an extensive evaluation across various tasks, datasets, and model architectures.	- Simple methods can outperform complex models in certain tasks. - Negative correlation between F1 performance and complexity for small datasets.
Proceedings Article	Comparative Performance of Machine Learning Methods for Text Classification	2020	[22]	- Compare performance of machine learning and deep learning algorithms. - Explore scalability with larger data instances.	Previous studies on text classification primarily tested machine learning and deep learning methods with relatively small-sized data instances. This paper builds on that by comparing these methods’ performance and scalability using a larger dataset of 6000 instances across six classes.	- Deep learning outperforms traditional methods in text classification. - Scalability of methods for larger data instances explored.
Journal Article	A Survey on Text Classification using Machine Learning Algorithms	2019	[23]	- Explore algorithms for automated text document classification. - Select best features and classification algorithms for accuracy.	Previous studies on text classification have explored various methodologies, including feature selection techniques, like document frequency thresholding and information gain, and classification algorithms such as K-nearest neighbors and support vector machines, highlighting the importance of efficient keyword prioritization for accurate categorization.	- Automated text classification improves efficiency in document handling. - Reduces reliance on expert classification for large text documents.
Dataset	Text Classification Data from 15 Drug Class Review SLR Studies	2023	[24]	- Automate citation classification in systematic reviews. - Reduce workload in systematic review preparation.	The paper references a study by Cohen et al. (2006) that focused on reducing workload in systematic review preparation through automated citation classification, providing a foundation for the datasets used in the current text classification research on drug class reviews.	- Automates citation classification in systematic reviews. - Reduces workload for researchers in drug class studies.
Proceedings Article	An Exploration of the Effectiveness of Machine Learning Algorithms for Text Classification	2023	[25]	- Explore effectiveness of machine learning algorithms for text classification. - Compare performance of various algorithms like SVM, KNN, CNN, RNN.	The paper does not provide specific details on previous studies in text classification. It focuses on evaluating and comparing the performance of various machine learning algorithms, such as decision trees, SVM, KNN, CNN, and RNN for text classification tasks.	- Machine learning improves text classification accuracy and efficiency. - Algorithms can handle complex and large datasets effectively.
Proceedings Article	A Comparative Text Classification Study with Deep Learning-Based Algorithms	2022	[26]	- Compare deep learning algorithms for text classification. - Optimize hyperparameters and evaluate word embeddings’ effectiveness.	The paper compares its results with previous studies in the literature, highlighting significant improvements in classification performance using deep learning algorithms and word embeddings. It specifically utilizes an open-source Turkish News benchmarking dataset for this comparative analysis.	- Improved text classification performance using deep learning algorithms. - Effective hyperparameter tuning enhances classification accuracy.
Proceedings Article	Classification Models of Text: A Comparative Study	2021	[27]	- Overview of classification process stages. - Survey and compare popular classification algorithms.	The paper does not provide specific details on previous studies in text classification. Instead, it focuses on the classification process, including preprocessing, feature engineering, dimension decomposition, model selection, and evaluation, while surveying and comparing popular classification algorithms.	- Text classification has implications in education, politics, and finance. - The paper provides a comparative study of popular classification algorithms.
Journal Article	Trends and patterns of text classification techniques: a systematic mapping study	2020	[28]	- Provide an overview of text classification research trends and gaps. - Analyze research patterns, problems, and problem-solving methods in text classification.	The paper systematically reviews ninety-six studies on text classification from 2006 to 2017, identifying nine main problems and analyzing research patterns, data sources, language choices, and applied techniques, highlighting significant trends and gaps in the field.	- Highlights trends and gaps in text classification research. - Identifies nine main problems in text classification area.
Journal Article	Research On Text Classification Based On Deep Neural Network	2022	[4]	- Design text representation and classification models using deep networks. - Improve text feature representation and classification accuracy.	The paper highlights that traditional text classification methods, such as the bag-of-words model and vector space model, face challenges like loss of context, high dimensionality, and sparsity, prompting a shift towards deep learning techniques for improved performance.	- Deep learning models improve text classification performance compared to traditional methods. - The BRCNN and ACNN models proposed in the paper show better text feature representation and classification accuracy.

Table 2

Keyword Statistics.

Assign. Mean	Corr. Mean	P	R	F
8.6	3.6	41.5	46.9	44.0

Table 3

A summary of future research in text categorization.

Research Focus	Future Research Direction	Potential Applications
Universal Multilingual Models	Develop generalized models for multilingual text classification with minimal labeled data.	Cross-cultural communication, multilingual customer support, and global content moderation.
Low-Resource Languages	Use transfer learning, domain adaptation, and unsupervised methods to address data scarcity.	Language preservation, text analysis in underserved regions, and niche domain categorization.
Enhanced Language Identification	Improve techniques for detecting and processing multiple languages in text.	Multilingual user-generated content analysis and global social media monitoring.
Cultural Awareness in Models	Embed cultural sensitivity to improve classification relevance across diverse contexts.	Sentiment analysis, cross-border marketing, and international public opinion tracking.
High-Throughput Systems	Develop systems capable of processing large-scale, real-time data streams with minimal latency.	Live news categorization, stock market monitoring, and emergency response systems.
Dynamic Adaptation	Enhance models to adjust to shifting patterns and evolving content in real time.	Social media analytics, adaptive spam filtering, and customer sentiment tracking.
Multimodal Integration	Combine text with other modalities (images, videos, audio) for holistic content analysis.	Social media content moderation, e-commerce review analysis, and multimedia news classification.
Temporal Multimodal Analysis	Analyze how user sentiment or trends evolve over time using multiple data types.	Campaign monitoring, real-time sentiment tracking, and user behavior analysis.
Real-Time Systems	Optimize latency and computational efficiency for real-time applications.	Chatbots, fraud detection, and personalized content delivery.
Cross-Modal Transfer Learning	Enable knowledge transfer between text and other data modalities for enhanced classification.	Healthcare diagnostics, financial trend analysis, and multimedia content categorization.
Domain-Specific Frameworks	Design tailored models for specific industries like healthcare, finance, and legal analysis.	Medical text categorization, contract clause extraction, and investment report analysis.
AR/VR Integration	Integrate text categorization into augmented and virtual reality systems.	Interactive learning environments, immersive customer support, and AR-based real-time text translation.
Emotion and Context Detection	Combine multimodal inputs for nuanced emotion and context understanding.	Mental health monitoring, sentiment-based recommendations, and adaptive marketing strategies.
Interactive Multimodal Systems	Develop systems allowing real-time user feedback to refine classification accuracy.	Live content moderation, chatbot systems, and collaborative filtering in e-commerce.
Ethical Considerations and Bias Mitigation	Focus on identifying and mitigating biases in training data and algorithms.	Recruitment systems, content moderation for sensitive topics, and legal document categorization.
Explainable AI and Hybrid Models	Combine rule-based systems with ML for interpretability and transparency.	Regulatory compliance, healthcare decision support, and consumer trust building.
Energy-Efficient Architectures	Research architectures that optimize resource usage for text categorization.	Mobile applications, edge computing, and sustainable AI deployment in resource-constrained settings.
Anomaly Detection	Develop methods to detect inconsistencies across multimodal data streams.	Fraud detection, cybersecurity monitoring, and disaster response systems.
Real-Time Multilingual Systems	Extend real-time systems to handle multiple languages dynamically.	Global event monitoring, real-time multilingual chatbots, and international e-commerce platforms.

This table provides a synthesized overview of the key future research directions in text categorization, reflecting advancements in multilingual, multimodal, real-time, and ethical AI practices, along with their applications across various domains.

References

1. Joachims, T.; Sebastiani, F. Guest editors’ introduction to the special issue on automated text categorization. J. Intell. Inf. Syst.; 2002; 18, 103. [DOI: https://dx.doi.org/10.1023/A:1013652626023]

2. Knight, K. Mining online text. Commun. ACM; 1999; 42, pp. 58-61. [DOI: https://dx.doi.org/10.1145/319382.319394]

3. Pazienza, M.T. Information Extraction; Springer: Berlin/Heidelberg, Germany, 1999.

4. Sebastiani, F. Text categorization: Advances and challenges. Comput. Linguist.; 2024; 50, pp. 205-245. P.3

5. Yang, Y.; Joachims, T. Text categorization. Scholarpedia; 2008; 3, 4242. [DOI: https://dx.doi.org/10.4249/scholarpedia.4242]

6. Lewis, D.D.; Hayes, P.J. Special issue on text categorization. Inf. Retr. J.; 1994; 2, pp. 307-340.

7. Manning, C.; Schütze, H. Foundations of Statistical Natural Language Processing; MIT Press: Cambridge, MA, USA, 1999.

8. Paaß, G. Document classification, information retrieval, text and web mining. Handbook of Technical Communication; De Gruyter Mouton: Berlin/Heidelberg, Germany, 2012; Volume 8, 141.

9. Larabi-Marie-Sainte, S.; Bin Alamir, M.; Alameer, A. Arabic Text Clustering Using Self-Organizing Maps and Grey Wolf Optimization. Appl. Sci.; 2023; 13, 10168. [DOI: https://dx.doi.org/10.3390/app131810168]

10. Dhar, V. The evolution of text classification: Challenges and opportunities. AI Soc.; 2021; 36, pp. 123-135.

11. Chen, Y.; Zhang, X.-M. Research on Intelligent Natural Language Texts Classification. Int. J. Adv. Comput. Sci. Appl.; 2022; 13, [DOI: https://dx.doi.org/10.14569/ijacsa.2022.0130404]

12. Haoran, Z.; Lei, L. The Research Trends of Text Classification Studies (2000–2020): A Bibliometric Analysis. SAGE Open; 2022; 12, 21582440221089963. [DOI: https://dx.doi.org/10.1177/21582440221089963]

13. Zhou, X.; Gururajan, R.; Li, Y.; Venkataraman, R.; Tao, X.; Bargshady, G.; Barua, P.D.; Kondalsamy-Chennakesavan, S. A survey on text classification and its applications. Web Intell.; 2020; 18, pp. 205-216. [DOI: https://dx.doi.org/10.3233/WEB-200442]

14. Qian, L.; Hao, P.; Jianxin, L.; Cong-min, X.; Renyu, Y.; Lichao, S.; Philip, S.Y.; Lifang, H. A Survey on Text Classification: From Traditional to Deep Learning. ACM Trans. Intell. Syst. Technol.; 2022; 13, pp. 1-41. [DOI: https://dx.doi.org/10.1145/3495162]

15. Karim, A.; Hami, Y.; Loqman, C.; Boumhidi, J. Case Studies of Several Popular Text Classification Methods. International Conference on Digital Technologies and Applications; Springer Nature: Cham, Switzerland, 2023; pp. 552-560.

16. Zulqarnain, M.; Sheikh, R.; Hussain, S.; Sajid, M.; Abbas, S.N.; Majid, M.; Ullah, U. Text Classification Using Deep Learning Models: A Comparative Review. Cloud Comput. Data Sci.; 2024; 5, pp. 80-96.

17. Leena, B.; Satish, K.V. Survey on Text Classification. Int. J. Innov. Sci. Res. Technol.; 2020; 5, pp. 543-549. [DOI: https://dx.doi.org/10.38124/IJISRT20JUL380]

18. He, B.; Yang, Y.; Wang, L.; Zhou, J. The Text Classification Method Based on BiLSTM and Multi-Scale CNN. Comput. Life; 2024; 12, pp. 43-49. [DOI: https://dx.doi.org/10.54097/ypxxse31]

19. Mengnan, W. Research on Text Classification Method Based on NLP. Adv. Comput. Signals Syst.; 2022; 7, pp. 93-100. [DOI: https://dx.doi.org/10.23977/acss.2023.070213]

20. Samarth, K.; Bishnu, T.; Priyanka, D.; Asit, K.D. A Comparative Study on Various Text Classification Methods. Computational Intelligence in Pattern Recognition; Springer: Singapore, 2019.

21. Reusens, M.; Stevens, A.; Tonglet, J.; De Smedt, J.; Verbeke, W.; Vanden Broucke, S.; Baesens, B. Evaluating text classification: A benchmark study. Expert Syst. Appl.; 2024; 254, 124302. [DOI: https://dx.doi.org/10.1016/j.eswa.2024.124302]

22. Bello, A.M.; Rahat, I.; Anne, J.; Dianabasi, N. Comparative Performance of Machine Learning Methods for Text Classification. Proceedings of the 2020 International Conference on Computing and Information Technology; Tabuk, Saudi Arabia, 9–10 September 2020; [DOI: https://dx.doi.org/10.1109/ICCIT-144147971.2020.9213788]

23. Kowsari, K.; Jafari Meimandi, K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text classification algorithms: A survey. Information; 2019; 10, 150. [DOI: https://dx.doi.org/10.3390/info10040150]

24. Ankita, A.; Aravindan, M.K.; Manish, S.; Sathya, S.; Devika, A.V.; Jagmeet, S. An Exploration of the Effectiveness of Machine Learning Algorithms for Text Classification. Proceedings of the 2023 IEEE International Conference on Paradigm Shift in Information Technologies with Innovative Applications in Global Scenario; Indore, India, 28–29 December 2023.

25. Köksal, Ö.; Akgül, Ö. A Comparative Text Classification Study with Deep Learning-Based Algorithms. Proceedings of the 2022 9th International Conference on Electrical and Electronics Engineering; Alanya, Turkey, 29–31 March 2022.

26. Tiffany, Z. Classification Models of Text: A Comparative Study. Proceedings of the 2021 IEEE 11th Annual Computing and Communication Workshop and Conference; Vegas, NV, USA, 27–30 January 2021.

27. Maw, M.; Vimala, B.; Omer, R.; Sri Devi, R. Trends and patterns of text classification techniques: A systematic mapping study. Malays. J. Comput. Sci.; 2020; 33, pp. 102-117. [DOI: https://dx.doi.org/10.22452/mjcs.vol33no2.2]

28. Dea, W.K. Research on Text Classification Based on Deep Neural Network. Int. J. Commun. Netw. Inf. Secur.; 2022; 14, pp. 100-113. [DOI: https://dx.doi.org/10.17762/ijcnis.v14i1s.5618]

29. O’Donovan, M.A.; McCallion, P.; McCarron, M.; Lynch, L.; Mannan, H.; Byrne, E. A narrative synthesis scoping review of life course domains within health service utilisation frameworks. HRB Open Res.; 2019; 2, 6. [DOI: https://dx.doi.org/10.12688/hrbopenres.12900.1]

30. Dawar, I.; Kumar, N.; Pathan, S.; Layek, S. Text Categorization using Supervised Machine Learning Techniques. Proceedings of the 2023 Sixth International Conference of Women in Data Science at Prince Sultan University; Riyadh, Saudi Arabia, 14–15 March 2023; [DOI: https://dx.doi.org/10.1109/WiDS-PSU57071.2023.00046]

31. Quazi, S.; Musa, S.M. Performing Text Classification and Categorization through Unsupervised Learning. Proceedings of the 2023 1st International Conference on Advanced Engineering and Technologies; Kediri, Indonesia, 14 October 2023; [DOI: https://dx.doi.org/10.1109/iconnic59854.2023.10505896]

32. Karathanasi, L.C.; Bazinas, C.; Iordanou, G.; Kaburlasos, V.G. A Study on Text Classification for Applications in Special Education. Proceedings of the 2021 International Conference on Software, Telecommunications and Computer Networks; Split, Croatia, 23–25 September 2021; [DOI: https://dx.doi.org/10.23919/SOFTCOM52868.2021.9559128]

33. Kadhim, A.I. Survey on supervised machine learning techniques for automatic text classification. Artif. Intell. Rev.; 2019; 52, pp. 273-292. [DOI: https://dx.doi.org/10.1007/s10462-018-09677-1]

34. Ittoo, A.; van den Bosch, A. Text analytics in industry: Challenges, desiderata and trends. Comput. Ind.; 2016; 78, pp. 96-107. [DOI: https://dx.doi.org/10.1016/j.compind.2015.12.001]

35. Shen, D. Text Categorization. 2009; Available online: https://dl.acm.org/doi/abs/10.1145/1645953.1646192 (accessed on 10 November 2024).

36. Sajid, N.A.; Rahman, A.; Ahmad, M.; Musleh, D.; Basheer Ahmed, M.I.; Alassaf, R.; Chabani, S.; Ahmed, M.S.; Salam, A.A.; AlKhulaifi, D. Single vs. multi-label: The issues, challenges and insights of contemporary classification schemes. Appl. Sci.; 2023; 13, 6804. [DOI: https://dx.doi.org/10.3390/app13116804]

37. Chen, R.; Zhang, W.; Wang, X. Machine learning in tropical cyclone forecast modeling: A review. Atmosphere; 2020; 11, 676. [DOI: https://dx.doi.org/10.3390/atmos11070676]

38. Wang, Z.; Zhao, J.; Huang, H.; Wang, X. A review on the application of machine learning methods in tropical cyclone forecasting. Front. Earth Sci.; 2022; 10, 902596. [DOI: https://dx.doi.org/10.3389/feart.2022.902596]

39. Gasparetto, A.; Marcuzzo, M.; Zangari, A.; Albarelli, A. A survey on text classification algorithms: From text to predictions. Information; 2022; 13, 83. [DOI: https://dx.doi.org/10.3390/info13020083]

40. Shortliffe, E.H.; Buchanan, B.G.; Feigenbaum, E.A. Knowledge engineering for medical decision making: A review of computer-based clinical decision aids. Proc. IEEE; 1979; 67, pp. 1207-1224. [DOI: https://dx.doi.org/10.1109/PROC.1979.11436]

41. Ali, M.; Ali, R.; Khan, W.A.; Han, S.C.; Bang, J.; Hur, T.; Kim, D.; Lee, S.; Kang, B.H. A data-driven knowledge acquisition system: An end-to-end knowledge engineering process for generating production rules. IEEE Access; 2018; 6, pp. 15587-15607. [DOI: https://dx.doi.org/10.1109/ACCESS.2018.2817022]

42. Gupta, D. Applied Analytics Through Case Studies Using Sas and R: Implementing Predictive Models and Machine Learning Techniques; Apress: New York, NY, USA, 2018.

43. Sebastiani, F. Machine learning in automated text categorization. ACM Comput. Surv. (CSUR); 2002; 34, pp. 1-47. [DOI: https://dx.doi.org/10.1145/505282.505283]

44. Fuhr, N.; Knorz, G. Retrieval test evaluation of a rule based automatic indexing (AIR/PHYS). Proceedings of the 7th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; Cambridge, UK, 2–6 July 1984; pp. 391-408.

45. Borko, H.; Bernick, M. Automatic document classification. J. ACM (JACM); 1963; 10, pp. 151-162. [DOI: https://dx.doi.org/10.1145/321160.321165]

46. Larkey, L.S. A patent search and classification system. Proceedings of the fourth ACM Conference on Digital Libraries; Berkeley, CA, USA, 11–14 August 1999; pp. 179-187.

47. Hayes, P.J.; Weinstein, S.P. CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories. Proceedings of the IAAI; Washington, DC, USA, 1–3 May 1990; pp. 49-64.

48. Androutsopoulos, I.; Koutsias, J.; Chandrinos, K.V.; Spyropoulos, C.D. An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; Athens, Greece, 24–28 July 2000; pp. 160-167.

49. Drucker, H.; Wu, D.; Vapnik, V.N. Support vector machines for spam categorization. IEEE Trans. Neural Netw.; 1999; 10, pp. 1048-1054. [DOI: https://dx.doi.org/10.1109/72.788645] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/18252607]

50. Gale, W.A.; Church, K. A program for aligning sentences in bilingual corpora. Comput. Linguist.; 1993; 19, pp. 75-102.

51. Chakrabarti, S.; Dom, B.; Raghavan, P.; Rajagopalan, S.; Gibson, D.; Kleinberg, J. Automatic resource compilation by analyzing hyperlink structure and associated text. Comput. Netw. ISDN Syst.; 1998; 30, pp. 65-74. [DOI: https://dx.doi.org/10.1016/S0169-7552(98)00087-7]

52. Mohammad, S.M. Sentiment analysis: Detecting valence, emotions, and other affectual states from text. Emotion Measurement; Elsevier: Amsterdam, The Netherlands, 2016; pp. 201-237.

53. Yang, Y.; Liu, X. A re-examination of text categorization methods. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; Berkeley, CA, USA, 15–19 August 1999; pp. 42-49.

54. Forman, G. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res.; 2003; 3, pp. 1289-1305.

55. Aggarwal, C.C.; Zhai, C. An introduction to text mining. Mining Text Data; Springer: Berlin/Heidelberg, Germany, 2012; pp. 1-10.

56. McCallum, A.; Nigam, K. A comparison of event models for naive bayes text classification. Proceedings of the AAAI-98 workshop on Learning for Text Categorization; Madison, WI, USA, 26–27 July 1998; Volume 752, pp. 41-48.

57. Luo, X. Efficient English text classification using selected machine learning techniques. Alex. Eng. J.; 2021; 60, pp. 3401-3409. [DOI: https://dx.doi.org/10.1016/j.aej.2021.02.009]

58. Young, T.; Hazarika, D.; Poria, S.; Cambria, E. Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag.; 2018; 13, pp. 55-75. [DOI: https://dx.doi.org/10.1109/MCI.2018.2840738]

59. Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res.; 2003; 3, pp. 1157-1182.

60. Mondal, S.; Barman, A.K.; Basumatary, S.; Barman, M.; Rai, C.; Nag, A. Cancer Text Article Categorization and Prediction Model Based on Machine Learning Approach. Proceedings of the 2023 IEEE 3rd Mysore Sub Section International Conference; Hassan, India, 1–2 December 2023.

61. Saha, S. A Comprehensive Guide to Convolutional Neural Networks—The ELI5 Way; Towards Data Science: San Francisco, CA, USA, 2018.

62. Ali, S.I.M.; Nihad, M.; Sharaf, H.M.; Farouk, H. Machine learning for text document classification-efficient classification approach. IAES Int. J. Artif. Intell.; 2024; 13, pp. 703-710. [DOI: https://dx.doi.org/10.11591/ijai.v13.i1.pp703-710]

63. Valluri, D.; Manne, S.; Tripuraneni, N. Custom Dataset Text Classification: An Ensemble Approach with Machine Learning and Deep Learning Models. Proceedings of the 2023 3rd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA); Bengaluru, India, 21–23 December 2023.

64. Manning, C.D. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008.

65. Salton, G.; Wong, A.; Yang, C.-S. A vector space model for automatic indexing. Commun. ACM; 1975; 18, pp. 613-620. [DOI: https://dx.doi.org/10.1145/361219.361220]

66. Van Otten, N. Vector Space Model Made Simple with Examples & Tutorial in Python; Spot Intelligence: London, UK, 2023.

67. Mikolov, T. Efficient estimation of word representations in vector space. arXiv; 2013; arXiv: 1301.3781

68. DataScienctyst. How to Create a Bag of Words in Pandas Python. Available online: https://datascientyst.com/create-a-bag-of-words-pandas-python/ (accessed on 24 November 2024).

69. Lovins, J.B. Development of a stemming algorithm. Mech. Transl. Comput. Linguist.; 1968; 11, pp. 22-31.

70. Ramos, J. Using tf-idf to determine word relevance in document queries. Proceedings of the First Instructional Conference on Machine Learning; Los Angeles, CA, USA, 23–24 June 2003; pp. 29-48.

71. Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag.; 1988; 24, pp. 513-523. [DOI: https://dx.doi.org/10.1016/0306-4573(88)90021-0]

72. Robertson, S.; Zaragoza, H. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr.; 2009; 3, pp. 333-389. [DOI: https://dx.doi.org/10.1561/1500000019]

73. Wang, T.; Cai, Y.; Leung, H.-f.; Cai, Z.; Min, H. Entropy-based term weighting schemes for text categorization in VSM. Proceedings of the 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI); Vietri sul Mare, Italy, 9–11 November 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 325-332.

74. Jones, K.S.; Walker, S.; Robertson, S.E. A probabilistic model of information retrieval: Development and comparative experiments: Part 2. Inf. Process. Manag.; 2000; 36, pp. 809-840. [DOI: https://dx.doi.org/10.1016/S0306-4573(00)00016-9]

75. Said, D.A. Dimensionality Reduction Techniques for Enhancing Automatic Text Categorization; Faculty of Engineering, Cairo University Master of Science: Cairo, Egypt, 2007.

76. Murty, M.; Raghava, R. Kernel-based SVM. Support Vector Machines and Perceptrons: Learning, Optimization, Classification, and Application to Social Networks; Spinger: Berlin/Heidelberg, Germany, 2016; pp. 57-67.

77. Li, B.; Yan, Q.; Xu, Z.; Wang, G. Weighted document frequency for feature selection in text classification. Proceedings of the 2015 International Conference on Asian Language Processing (IALP); Suzhou, China, 24–25 October 2015; pp. 132-135.

78. Christian, H.; Agus, M.P.; Suhartono, D. Single document automatic text summarization using term frequency-inverse document frequency (TF-IDF). ComTech Comput. Math. Eng. Appl.; 2016; 7, pp. 285-294. [DOI: https://dx.doi.org/10.21512/comtech.v7i4.3746]

79. Peng, T.; Liu, L.; Zuo, W. PU text classification enhanced by term frequency–inverse document frequency-improved weighting. Concurr. Comput. Pract. Exp.; 2014; 26, pp. 728-741. [DOI: https://dx.doi.org/10.1002/cpe.3040]

80. Magnello, M.E. Karl Pearson, paper on the chi square goodness of fit test (1900). Landmark Writings in Western Mathematics 1640–1940; Elsevier: Amsterdam, The Netherlands, 2005; pp. 724-731.

81. Greenwood, P.E.; Nikulin, M.S. A Guide to Chi-Squared Testing; John Wiley & Sons: Hoboken, NJ, USA, 1996; Volume 280.

82. Chen, Y.-T.; Chen, M.C. Using chi-square statistics to measure similarities for text categorization. Expert Syst. Appl.; 2011; 38, pp. 3085-3090. [DOI: https://dx.doi.org/10.1016/j.eswa.2010.08.100]

83. Meesad, P.; Boonrawd, P.; Nuipian, V. A chi-square-test for word importance differentiation in text classification. Proceedings of the International Conference on Information and Electronics Engineering; Bangkok, Thailand, 28–29 May 2011; pp. 110-114.

84. Wang, G.; Lochovsky, F.H. Feature selection with conditional mutual information maximin in text categorization. Proceedings of the Thirteenth Acm International Conference on Information and Knowledge Management; Washington, DC, USA, 8–13 November 2004; pp. 342-349.

85. Lewis, D.D. An evaluation of phrasal and clustered representations on a text categorization task. Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; Copenhagen, Denmark, 21–24 June 1992; pp. 37-50.

86. Dhar, A.; Mukherjee, H.; Dash, N.S.; Roy, K. Text categorization: Past and present. Artif. Intell. Rev.; 2021; 54, pp. 3007-3054. [DOI: https://dx.doi.org/10.1007/s10462-020-09919-1]

87. Lhazmir, S.; El Moudden, I.; Kobbane, A. Feature extraction based on principal component analysis for text categorization. Proceedings of the 2017 International Conference on Performance Evaluation and Modeling in Wired and Wireless Networks (PEMWN); Paris, France, 28–30 November 2017; pp. 1-6.

88. Bafna, P.; Pramod, D.; Vaidya, A. Document clustering: TF-IDF approach. Proceedings of the 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT); Chennai, India, 3–5 March 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 61-66.

89. Franke, T.M.; Ho, T.; Christie, C.A. The chi-square test: Often used and more often misinterpreted. Am. J. Eval.; 2012; 33, pp. 448-458. [DOI: https://dx.doi.org/10.1177/1098214011426594]

90. Tschannen, M.; Djolonga, J.; Rubenstein, P.K.; Gelly, S.; Lucic, M. On mutual information maximization for representation learning. arXiv; 2019; arXiv: 1907.13625

91. Cardoso-Cachopo, A.; Oliveira, A.L. An empirical comparison of text categorization methods. Proceedings of the International Symposium on String Processing and Information Retrieval; Manaus, Brazil, 8–10 October 2003; Spinger: Berlin/Heidelberg, Germany, 2003; pp. 183-196.

92. Yang, Y. An evaluation of statistical approaches to text categorization. Inf. Retr.; 1999; 1, pp. 69-90. [DOI: https://dx.doi.org/10.1023/A:1009982220290]

93. Baldi, P.; Brunak, S.; Chauvin, Y.; Andersen, C.A.; Nielsen, H. Assessing the accuracy of prediction algorithms for classification: An overview. Bioinformatics; 2000; 16, pp. 412-424. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/10871264][DOI: https://dx.doi.org/10.1093/bioinformatics/16.5.412]

94. Ruiz, M.E.; Srinivasan, P. Hierarchical text categorization using neural networks. Inf. Retr.; 2002; 5, pp. 87-118. [DOI: https://dx.doi.org/10.1023/A:1012782908347]

95. Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K. Using knn model for automatic text categorization. Soft Comput.; 2006; 10, pp. 423-430. [DOI: https://dx.doi.org/10.1007/s00500-005-0503-y]

96. Lewis, D.D. Evaluating text categorization i. Speech and Natural Language: Proceedings of a Workshop Held at Pacific Grove, California; Morgan Kaufmann Publishers: Burlington, MA, USA, 1991.

97. Wang, B.; Li, C.; Pavlu, V.; Aslam, J. A pipeline for optimizing f1-measure in multi-label text classification. Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA); Orlando, FL, USA, 17–20 December 2018.

98. Hulth, A.; Megyesi, B. A study on automatically extracted keywords in text categorization. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics; Sydney, Australia, 17–21 July 2006; pp. 537-544.

99. Wong, T.-T. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognit.; 2015; 48, pp. 2839-2846. [DOI: https://dx.doi.org/10.1016/j.patcog.2015.03.009]

100. Moss, H.B.; Leslie, D.S.; Rayson, P. Using JK fold cross validation to reduce variance when tuning NLP models. arXiv; 2018; arXiv: 1806.07139

101. Marcot, B.G.; Hanea, A.M. What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?. Comput. Stat.; 2021; 36, pp. 2009-2031. [DOI: https://dx.doi.org/10.1007/s00180-020-00999-9]

102. Bai, Y.; Chen, M.; Zhou, P.; Zhao, T.; Lee, J.; Kakade, S.; Wang, H.; Xiong, C. How important is the train-validation split in meta-learning?. Proceedings of the International Conference on Machine Learning; Virtual, 18–24 July 2021; pp. 543-553.

103. Vabalas, A.; Gowen, E.; Poliakoff, E.; Casson, A.J. Machine learning algorithm validation with a limited sample size. PLoS ONE; 2019; 14, e0224365. [DOI: https://dx.doi.org/10.1371/journal.pone.0224365] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31697686]

104. Zhang, H.; Zhang, L.; Jiang, Y. Overfitting and underfitting analysis for deep learning based end-to-end communication systems. Proceedings of the 2019 11th International Conference on Wireless Communications and Signal Processing (WCSP); Xi’an, China, 23–25 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1-6.

105. Bu, C.; Zhang, Z. Research on overfitting problem and correction in machine learning. J. Phys. Conf. Ser.; 2020; 1693, 012100. [DOI: https://dx.doi.org/10.1088/1742-6596/1693/1/012100]

106. Dogra, V.; Verma, S.; Kavita,; Chatterjee, P.; Shafi, J.; Choi, J.; Ijaz, M.F. A Complete Process of Text Classification System Using State-of-the-Art NLP Models. Comput. Intell. Neurosci.; 2022; 2022, 1883698. [DOI: https://dx.doi.org/10.1155/2022/1883698] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35720939]

107. Hachiya, H.; Yoshida, H.; Shimada, U.; Ueda, N. Multi-class AUC maximization for imbalanced ordinal multi-stage tropical cyclone intensity change forecast. Mach. Learn. Appl.; 2024; 17, 100569. [DOI: https://dx.doi.org/10.1016/j.mlwa.2024.100569]

108. Liu, Y.; Loh, H.T.; Sun, A. Imbalanced text classification: A term weighting approach. Expert Syst. Appl.; 2009; 36, pp. 690-701. [DOI: https://dx.doi.org/10.1016/j.eswa.2007.10.042]

109. Nagy, G.; Zhang, X. Simple statistics for complex feature spaces. Data Complexity in Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2006; pp. 173-195.

110. Le, P.Q.; Iliyasu, A.M.; Garcia, J.; Dong, F.; Hirota, K. Representing visual complexity of images using a 3d feature space based on structure, noise, and diversity. J. Adv. Comput. Intell. Intell. Inform.; 2012; 16, pp. 631-640. [DOI: https://dx.doi.org/10.20965/jaciii.2012.p0631]

111. Mars, M. From word embeddings to pre-trained language models: A state-of-the-art walkthrough. Appl. Sci.; 2022; 12, 8805. [DOI: https://dx.doi.org/10.3390/app12178805]

112. Sinjanka, Y.; Musa, U.I.; Malate, F.M. Text Analytics and Natural Language Processing for Business Insights: A Comprehensive Review. Int. J. Res. Appl. Sci. Eng. Technol.; 2023; 11, [DOI: https://dx.doi.org/10.22214/ijraset.2023.55893]

113. Bashiri, H.; Naderi, H. Comprehensive review and comparative analysis of transformer models in sentiment analysis. Knowl. Inf. Syst.; 2024; 66, pp. 7305-7361. [DOI: https://dx.doi.org/10.1007/s10115-024-02214-3]

114. Yadav, A.; Patel, A.; Shah, M. A comprehensive review on resolving ambiguities in natural language processing. AI Open; 2021; 2, pp. 85-92. [DOI: https://dx.doi.org/10.1016/j.aiopen.2021.05.001]

115. Seneviratne, I.S. Text Simplification Using Natural Language Processing and Machine Learning for Better Language Understandability. Ph.D. Thesis; The Australian National University: Canberra, Australia, 2024.

116. Garg, R.; Kiwelekar, A.W.; Netak, L.D.; Bhate, S.S. Potential use-cases of natural language processing for a logistics organization. Modern Approaches in Machine Learning and Cognitive Science: A Walkthrough: Latest Trends in AI; Springer: Berlin/Heidelberg, Germany, 2021; Volume 2, pp. 157-191.

117. Kim, Y. Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Doha, Qatar, 25–29 October 2014; pp. 1746-1751.

118. Johnson, R.; Zhang, T. Effective use of word order for text categorization with convolutional neural networks. arXiv; 2014; arXiv: 1412.1058

119. Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; San Diego, CA, USA, 12–17 June 2016; pp. 1480-1489.

120. Schmidt, R.M. Recurrent neural networks (rnns): A gentle introduction and overview. arXiv; 2019; arXiv: 1912.05911

121. Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv; 2018; arXiv: 1810.04805

122. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog; 2019; 1, 9.

123. Brown, T.B. Language models are few-shot learners. arXiv; 2020; arXiv: 2005.14165

124. Azevedo, B.F.; Rocha, A.M.A.; Pereira, A.I. Hybrid approaches to optimization and machine learning methods: A systematic literature review. Mach. Learn.; 2024; 113, pp. 4055-4097. [DOI: https://dx.doi.org/10.1007/s10994-023-06467-x]

125. Willard, J.; Jia, X.; Xu, S.; Steinbach, M.; Kumar, V. Integrating scientific knowledge with machine learning for engineering and environmental systems. ACM Comput. Surv.; 2022; 55, pp. 1-37. [DOI: https://dx.doi.org/10.1145/3514228]

126. Banu, S.; Ummayhani, S. Text summarisation and translation across multiple languages. J. Sci. Res. Technol.; 2023; 1, pp. 242-247.

127. Orosoo, M.; Goswami, I.; Alphonse, F.R.; Fatma, G.; Rengarajan, M.; Bala, B.K. Enhancing Natural Language Processing in Multilingual Chatbots for Cross-Cultural Communication. Proceedings of the 2024 5th International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV); Tirunelveli, India, 11–12 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 127-133.

128. Liang, L.; Wang, S. Spanish Emotion Recognition Method Based on Cross-Cultural Perspective. Front. Psychol.; 2022; 13, 849083. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35712164][DOI: https://dx.doi.org/10.3389/fpsyg.2022.849083]

129. Artetxe, M.; Labaka, G.; Agirre, E. Translation artifacts in cross-lingual transfer learning. arXiv; 2020; arXiv: 2004.04721

130. Schuster, S.; Gupta, S.; Shah, R.; Lewis, M. Cross-lingual transfer learning for multilingual task oriented dialog. arXiv; 2018; arXiv: 1810.13327

131. Yu, M.; Huang, Q.; Qin, H.; Scheele, C.; Yang, C. Deep learning for real-time social media text classification for situation awareness–using Hurricanes Sandy, Harvey, and Irma as case studies. Social Sensing and Big Data Computing for Disaster Management; Routledge: London, UK, 2020; pp. 33-50.

132. Demirsoz, O.; Ozcan, R. Classification of news-related tweets. J. Inf. Sci.; 2017; 43, pp. 509-524. [DOI: https://dx.doi.org/10.1177/0165551516653082]

133. Van de Ven, G.M.; Tuytelaars, T.; Tolias, A.S. Three types of incremental learning. Nat. Mach. Intell.; 2022; 4, pp. 1185-1197. [DOI: https://dx.doi.org/10.1038/s42256-022-00568-3] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36567959]

134. Yan, H.; Gui, T.; Dai, J.; Guo, Q.; Zhang, Z.; Qiu, X. A unified generative framework for various NER subtasks. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing; Online, 1–6 August 2021.

135. Mohit, B. Named entity recognition. Natural Language Processing of Semitic Languages; Springer: Berlin/Heidelberg, Germany, 2014; pp. 221-245.

136. Bui, D.D.A.; Del Fiol, G.; Jonnalagadda, S. PDF text classification to leverage information extraction from publication reports. J. Biomed. Inform.; 2016; 61, pp. 141-148. [DOI: https://dx.doi.org/10.1016/j.jbi.2016.03.026] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27044929]

Word count: 21165

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

The automated classification of texts into predefined categories has become increasingly prominent, driven by the exponential growth of digital documents and the demand for efficient organization. This paper serves as an in-depth survey of text classification and machine learning, consolidating diverse aspects of the field into a single, comprehensive resource—a rarity in the current body of literature. Few studies have achieved such breadth, and this work aims to provide a unified perspective, offering a significant contribution to researchers and the academic community. The survey examines the evolution of machine learning in text categorization (TC), highlighting its transformative advantages over manual classification, such as enhanced accuracy, reduced labor, and adaptability across domains. It delves into various TC tasks and contrasts machine learning methodologies with knowledge engineering approaches, demonstrating the strengths and flexibility of data-driven techniques. Key applications of TC are explored, alongside an analysis of critical machine learning methods, including document representation techniques and dimensionality reduction strategies. Moreover, this study evaluates a range of text categorization models, identifies persistent challenges like class imbalance and overfitting, and investigates emerging trends shaping the future of the field. It discusses essential components such as document representation, classifier construction, and performance evaluation, offering a well-rounded understanding of the current state of TC. Importantly, this paper also provides clear research directions, emphasizing areas requiring further innovation, such as hybrid methodologies, explainable AI (XAI), and scalable approaches for low-resource languages. By bridging gaps in existing knowledge and suggesting actionable paths forward, this work positions itself as a vital resource for academics and industry practitioners, fostering deeper exploration and development in text classification.

Details

Title

Text Classification: How Machine Learning Is Revolutionizing Text Categorization

Author

Allam, Hesham

; Makubvure, Lisa; Gyamfi, Benjamin

; Graham, Kwadwo Nyarko; Kehinde Akinwolere

First page

130

Publication year

2025

Publication date