Wavelet analysis text classification algorithm

Full text

Turn on search term navigation

1. Introduction

A new text feature extraction algorithm is involved in this study, and based on it, a new text classification algorithm is proposed. Therefore, the research background and motivation are discussed in two corresponding sections.

(1) Research on text feature extraction algorithms

Text feature extraction is a core task in natural language processing, and the specific methods can be broadly classified into traditional statistical methods, word embedding methods, and deep learning methods. (a) Traditional statistical methods mainly include: ① Bag of Words (BoW) [1]: The main principle is to represent a text by counting the frequency of each word within it. The primary advantage of this method is its simplicity in implementation, while its main drawback is the loss of contextual and word order information. ② Term Frequency-Inverse Document Frequency (TF-IDF) [2]: This method evaluates the importance of a word by calculating the ratio of its frequency in a document to its frequency across the entire corpus. The primary advantage is its ability to suppress the influence of common words, while the main drawback is its disregard for semantic and contextual relationships. (b) Word embedding methods mainly include: ① Word2Vec [3]: This method uses neural networks to convert words into low-dimensional vectors. The two main models are Skip-gram and Continuous Bag of Words (CBOW) [4]. The primary advantage is its ability to capture the semantic similarity of words and contextual information, while the main drawbacks are the need for large amounts of data for training and poor performance on rare words. ② Global Vectors for Word Representation (GloVe) [5]: This method generates word vectors by constructing a word co-occurrence matrix and performing matrix factorization. The primary advantage is its ability to capture global co-occurrence information, while the main drawback is the significant amount of memory required to store the co-occurrence matrix. ③ FastText [6]: This method improves the handling of rare words by considering subword units (n-grams). The primary advantage is its ability to better manage rare words and spelling errors, while the main drawback is the increased complexity of the model. ④ Bidirectional Encoder Representations from Transformers (BERT) [7]: This is a pre-trained language model based on the Transformer architecture. The primary advantage is its ability to capture deep semantic information, while the main drawbacks are the significant computational resources and large amounts of data required for pre-training. (c) Deep learning methods mainly include: ① Long Short-Term Memory (LSTM) [8] and Gated Recurrent Unit (GRU) [9]: These methods use recurrent neural networks, which are well-suited for processing sequential data. The primary advantage is their ability to capture long-term dependencies in time series, while the main drawbacks are the slow training process and the significant computational resources required. ② Transformers [10]: These models use attention mechanisms to capture long-range dependencies. The primary advantage is their excellent performance in handling long sequences of data, while the main drawbacks are the high model complexity and the significant computational and data resources required.

However, despite the fact that word embedding methods and deep learning methods have become mainstream, traditional statistical methods still have the following key advantages: ① Lower computational resource requirements. Traditional statistical methods are simple and intuitive, requiring less computational power and not involving complex parameter adjustments. ② Better real-time performance. When computational resources are limited or real-time processing is required, traditional statistical methods offer higher computational efficiency and can provide sufficiently fast response times and reasonable accuracy for certain real-time applications. ③ Providing baseline standards and feature combinations. Traditional statistical methods can serve as baseline models to evaluate the performance of more complex models, offering researchers a simple comparative standard. Additionally, traditional statistical methods can be combined with word embedding methods to enhance model performance.

It is precisely because traditional statistical methods still possess undeniable advantages and value that some researchers continue to focus on exploring methods in this field. In recent years, research efforts have primarily centered on text feature selection methods, leading to several significant achievements. The study in [11] proposed the TF-ICF algorithm, which differs from the inverse document frequency (IDF) used in TF-IDF by employing inverse category frequency to measure the importance of terms. This approach is a theoretically concise and effective supervised feature weighting algorithm. The researcher Bekir Parlak and colleagues have also conducted a series of studies on text feature selection methods. For instance, in [12], they explored three globalizing techniques—SUM, AVG, and MAX—to transform local feature weights into global feature weights. In [13], they proposed the FCWS algorithm, which integrates the joint probability distribution of features and categories, effectively reducing the bias associated with traditional methods when handling imbalanced datasets. Additionally, in [14], they introduced the EFS algorithm, which calculates term specificity by combining category probability and corpus probability.

However, the aforementioned studies, particularly frequency-based statistical algorithms represented by TF-IDF and TF-ICF, have not fully leveraged labeled text datasets (data samples) nor comprehensively explored text classification from the perspective of category-specific feature statistics.In contrast, the FCWS algorithm proposed in [12–14] requires considering the joint probability distribution of term features and categories during the computation process, incorporating both intra-class and inter-class feature information. Similarly, the EFS algorithm takes into account both intra-class and inter-class feature distributions. These methods involve relatively complex statistical computations, which increase the difficulty of understanding and implementing the algorithms.

In view of the aforementioned issues, the first research motivation of this paper is as follows: First, this study adheres to the theoretically concise and intuitive approach of frequency-based statistical algorithms. Building upon the foundations of TF-IDF and TF-ICF, the proposed method retains the advantages of frequency-based algorithms, such as their intuitiveness, simplicity, ease of understanding, and straightforward implementation, while introducing new enhancements. Second, this paper further explores two key questions related to traditional TF-IDF and TF-ICF algorithms: ① Existing research on term frequency statistics is typically confined to individual texts. Can the scope of such statistics be expanded to labeled text datasets? In other words, can terms that frequently appear in a specific category of text better represent that category? ② Influenced by mainstream IDF-based approaches, existing studies predominantly adopt a “reverse thinking” paradigm when calculating document frequency or category frequency. Could a “positive thinking” approach be considered instead? Specifically, if a term appears more frequently within a certain category of text, does it imply a stronger representativeness of that term for the given category?

In order to answer the above two questions, the first research goal of this paper is: still focus the attention on the frequency statistical algorithm, put forward a full use of the existing category label data samples, the word frequency statistics to a certain text set, and text frequency statistics from “reverse thinking” to “positive thinking” text category feature extraction algorithm: Average Term frequency-Document frequency (ATF-DF).

Further, corresponding to the first research objectives, the supplementary discussion of the first research motivation is as follows: Currently, in the field of text feature extraction and text classification, there are already numerous labeled text datasets available. The data conditions for conducting scientific research have significantly changed compared to when TF-IDF was first introduced. Based on these advancements, supervised text feature weighting algorithms have since been developed.However, under the statistical thinking of big data, there is still room for further research on the statistical objects and angles of the existing algorithms.Therefore, this paper proposes the ATF-DF algorithm, aiming to achieve better results in extracting category-specific text features. Furthermore, this algorithm can serve as a foundation for developing new text classification algorithms.

(2) Research on text classification algorithms

Since this paper also involves a new text classification algorithm, it is necessary to introduce the current state of research on related text classification algorithms and the motivation behind this research.

Text classification algorithms can currently be divided into two main categories: deep learning and shallow learning.

Deep learning algorithms [15] are typically implemented using convolutional neural networks (CNN), recurrent neural networks (RNN), and their variants such as LSTM and BERT. In recent years, research in this area has been highly active due to the advantages of deep learning algorithms in automatic feature extraction, handling large-scale data, managing complex nonlinear relationships, transfer learning, and pre-training mechanisms, and has become the mainstream approach in modern text classification methods. [16] provides an in-depth exploration of over 150 deep learning models and more than 40 datasets, offering a thorough study of the technical characteristics, performance, and application scenarios of various models. [17] proposes an improved Elastic Deep Autoencoder (EDA-TEC), which can simultaneously learn the manifold representation of data and clustering labels, thereby enhancing the effectiveness of clustering. [18] introduced the LyEmoBERT model, which is designed for classifying emotions in song lyrics. The study in [19] introduced the BART classifier based on the pre-trained BERT model. This classifier, which leverages a denoising autoencoder built upon the Transformer architecture, has demonstrated strong performance in classification tasks. The study in [20] indicated that the pre-trained GPT model, through fine-tuning, can also achieve outstanding performance across multiple text classification tasks.The main advantages of deep learning algorithms are: they typically outperform shallow learning algorithms in terms of classification performance when dealing with large-scale datasets and complex text classification tasks. The main drawbacks are: high data and computational resource requirements, poor interpretability, and complex hyperparameter tuning.

However, despite the significant achievements of modern text classification methods represented by deep learning algorithms, there remain several noteworthy limitations.① Deep learning algorithms typically rely on large volumes of labeled data to train complex model architectures, which limits their applicability in scenarios where data is scarce or annotation is costly. However, it is important to note that with technological advancements, fine-tuning pre-trained models such as BERT has enabled certain deep learning models to achieve significant performance even on relatively small datasets, thereby alleviating some of these limitations to a certain extent. ② Deep learning models impose substantial demands on computational resources, relying heavily on high-performance GPUs or TPUs. This presents a significant challenge for resource-constrained users, making it difficult to afford or deploy such models, particularly on widely used mobile smart devices where computational capacity is often limited. ③ The “black-box” nature of deep learning models makes their decision-making processes difficult to interpret, posing a significant drawback in applications that require clear and transparent classification criteria, such as legal or medical text classification. ④ The selection and tuning of hyperparameters in some deep learning models require extensive experimentation and domain expertise, further increasing the difficulty and cost of deployment and use.

In comparison, shallow learning algorithms often exhibit inferior performance and scalability when handling large-scale datasets or high-dimensional data compared to deep learning algorithms. However, shallow learning algorithms still have advantages in the following scenarios: ① Limited computational resources. ② Small-scale dataset scenarios. ③ High interpretability requirements. ④ Understanding data structure and patterns and facilitating feature engineering. ⑤ Providing new baseline models. ⑥ Scenarios requiring rapid development of usable classifiers.

Because of the continuing research value of shallow learning, many scholars are still committed to studying shallow learning algorithms. From a technical principle perspective, shallow learning algorithms can be categorized into methods based on mathematical statistics, decision trees, geometric approaches, and optimization, among others. Notable research outcomes include:

In the area of mathematical statistics-based methods, [21] proposes an effective K-Nearest Neighbor (KNN) classification model that uses KTree to select the optimal K value for each test sample. This method outperforms competing approaches in terms of classification precision and computational cost. [22] introduces a parallel Naive Bayes algorithm (PNBA) based on the Spark platform, which is used for large-scale Chinese text classification. In the area of decision tree-based algorithms, [23] presents a decision tree smoothing algorithm that enhances decision tree performance, stabilizes probability estimates, converts the model into easily interpretable rule sets, and is suitable for large-scale datasets. In the area of geometric methods, [24] found that the SVM classifier achieved an excellent F1 score above 86.26% in text classification tasks. [25] compares SVM algorithms with artificial neural network algorithms and finds that selecting an appropriate feature set is crucial for accurate classification. In the area of optimization methods, [26] proposes an improved Discrete Layered Chicken Algorithm (IDLCA), which enhances feature selection and text classification performance through adaptive operators.

In addition to the methods mentioned above, new approaches to text classification algorithms continue to evolve. One noteworthy research trend is that a small number of scholars have begun to introduce wavelet analysis tools into the field of text classification. This approach, which can be referred to as wavelet analysis-based methods, has produced some notable research outcomes, including: [27] uses wavelet analysis to reduce the dimensionality of text feature space vectors.[28] first constructs a semantic network from the text, converts it into an image using a template, and then applies wavelet analysis to extract image features for text classification. This study represents an active exploration of using wavelet analysis tools for text classification.

However, existing research still faces the following issues: The approach in [27] focuses solely on using wavelet analysis for dimensionality reduction, rather than directly performing text classification calculations. [28] requires converting the text into an image before applying wavelet analysis for classification, resulting in a more complex algorithmic logic. Consequently, the final text classification precision is only 15.404%, which is even lower than that of earlier classification algorithms. Overall, current research on wavelet analysis in the field of text classification is limited and has not yet achieved the desired results.

Therefore, the second research objective of this paper is to develop a new text classification algorithm, Average Term Frequency-Document Frequency-Wavelet Analysis (ATF-DF-WA), based on the ATF-DF text feature extraction algorithm proposed in this study.

The second research motivation of this paper can be described as follows. From an informatics perspective, both text and waveforms can be regarded as forms of encoding that can be mutually transformed, allowing the integration of analytical theories and tools from their respective domains. Given the widespread application and remarkable effectiveness of wavelet analysis in scientific research, this paper develops a keen interest in applying wavelet analysis to text classification. By thoroughly analyzing the limitations of existing studies, this research aims to advance improvements in this field. Specifically, the technical motivation is to extract category-specific features from text using the ATF-DF algorithm, represent these features as vectors, and subsequently transform them into waveforms for further analysis. Wavelet analysis is then employed to analyze these waveforms to accomplish the text classification task.

Therefore, the ATF-DF-WA algorithm is divided into two stages. The first stage is the feature extraction phase, where the ATF-DF algorithm is used to fully extract features from large-scale, multi-class text samples with existing category labels, resulting in class-typical feature vectors. The second stage is the text classification phase. In this stage, feature vectors for the samples to be classified are generated based on the class-typical feature vectors.These feature vectors undergo wavelet analysis (WA) to produce feature layer waveforms, and text classification is completed by calculating the similarity between these waveforms and the class-typical feature layer waveforms.

The proposed ATF-DF-WA algorithm, compared to shallow learning algorithms, not only inherits the efficient computational characteristics of traditional statistical methods but also enhances the accuracy of feature extraction by incorporating category label information. Moreover, by applying wavelet analysis theory to text classification, the algorithm further improves classification performance through waveform similarity calculations.Compared to deep learning algorithms, the ATF-DF-WA algorithm does not require extensive computational resources or large-scale training data, making it suitable for deployment on devices with limited computational power. Additionally, it offers higher interpretability and real-time responsiveness. Its specific advantages include: ① High computational efficiency, making it suitable for resource-constrained devices, particularly excelling in low-resource environments such as mobile smart terminals. ② Strong interpretability, which makes it well-suited for tasks that require clear and transparent classification criteria. ③ No requirement for large-scale training data and high-performance hardware, making it an ideal solution for data-scarce and low-cost application scenarios. ④ Good real-time performance, enabling fast responses, which is beneficial for real-time applications.

Therefore, ATF-DF-WA provides an effective alternative solution to both deep learning and traditional shallow learning algorithms for application scenarios that demand high efficiency, low resource consumption, and strong interpretability. This method demonstrates significant potential for practical applications, particularly in various text classification tasks involving labeled text datasets. For instance, in domains such as news, law, and medicine, text category features often exhibit certain regularities. By constructing category feature vectors and integrating wavelet analysis, ATF-DF-WA can accurately capture these features, achieving a significant improvement in classification performance while maintaining low computational cost.

The main contributions of this paper are threefold

1. ①. It introduces the ATF-DF algorithm, a traditional statistical method for feature extraction. It is designed for large-scale data sample scenarios, focusing on extracting category features from labeled data samples. Compared to other traditional statistical methods, this algorithm provides more accurate extraction of text category features.

2. ②. The proposed ATF-DF-WA algorithm is a wavelet analysis-based method in text classification. In the first stage, it fully extracts text category features using the ATF-DF algorithm. In the second stage, it transforms the text classification problem into a waveform analysis problem, utilizing wavelet analysis tools for text feature extraction and analysis. This approach effectively leverages the strengths of wavelet analysis in signal feature processing and analysis. As a result, the evaluation metrics for text classification using this algorithm show significant improvements compared to traditional text classification algorithms.

3. ③. The proposed ATF-DF-WA algorithm, compared to deep learning-based text classification algorithms, not only fully utilizes large datasets but also maintains the advantages of interpretability and understandability in the classification process, avoiding the black-box issue. Additionally, the algorithm requires significantly less training data and computational resources compared to deep learning algorithms. While achieving significant improvements in classification performance, it also retains the benefits of shallow learning algorithms.

The remainder of this paper is organized as follows: Chapter 1 introduces the theoretical background related to this study; Chapter 2 presents the theory and experimental results of the ATF-DF algorithm; Chapter 3 discusses the theory and experimental results of the ATF-DF-WA algorithm; and Chapter 4 provides a summary of the conclusions, identifies the limitations of the study, and outlines directions for future research.

2 Background theoretical

2.1 TF-IDF feature extraction algorithm

In text classification tasks, the first stage is typically feature extraction. Commonly used text feature extraction algorithms include TF-IDF, N-gram models, Word2Vec, and -statistics. Among these, the TF-IDF algorithm is particularly representative and is the primary focus of comparative research in text feature extraction in this paper.

The TF-IDF feature extraction algorithm combines two factors: term frequency (TF) and inverse document frequency (IDF) to calculate the weight of each word. The calculation equation is as follows:

(1)

Where represents the TF-IDF weight of the k -th term in the j -th document, denotes the frequency of in the j -th document, represents the total number of documents in the corpus, and is the number of documents in that contain .

2.2 Term frequency-inverse category frequency(TF-ICF)feature extraction algorithm

Compared to the TF-IDF algorithm, TF-ICF [11] places greater emphasis on the distribution of terms at the category level rather than across the entire document collection. The core idea of TF-ICF is that terms appearing in fewer categories are more distinctive for text classification and should therefore be assigned higher weights. The equation is shown in (2).

(2)

Where, represents the frequency of term t in document d, while is the inverse category frequency of term t, calculated using the following equation:

(3)

Where represents the total number of categories in the training corpus, and denotes the number of categories in which the term t appears.

2.3 Wavelet analysis

Wavelet Analysis [29] can decompose a signal into sub-signals of different frequencies, allowing for a better understanding and analysis of the signal’s characteristics and properties. The principle involves using a set of functions called wavelet bases, where the original signal is convolved with the wavelet bases to obtain a set of wavelet coefficients. By adjusting and processing these coefficients, the signal can be decomposed and reconstructed. Wavelet bases are a specific set of function families, with common examples including Morlet wavelets, Haar wavelets, and Daubechies wavelets. By selecting different wavelet bases and adjusting the number of decomposition levels, wavelet analysis can be better tailored to different types of signal processing tasks. Continuous Wavelet Transform (CWT) is used to decompose a signal into wavelet components at different scales and frequencies. By convolving the signal with a wavelet basis function, CWT generates wavelet coefficients under varying scales and translations, thereby characterizing the energy distribution of the signal in both the frequency and time domains. The equation is as shown in (4):

(4)

where represents the wavelet coefficient at scale parameter a and translation parameter b, is the original signal, and denotes the conjugate function of the wavelet basis.

2.4 Pearson correlation coefficient

The Pearson correlation coefficient between two variables x and y is denoted as r or , and its equation is shown in (5):

(5)

where is the covariance between and y, and and are the variances of and y, respectively.

The Pearson correlation coefficient ranges from [−1, 1]. Its sign indicates the direction of the relationship between the variables, while the magnitude (closer to −1 or 1) reflects the strength of this relationship. A value of −1 indicates a perfect negative linear relationship, 0 signifies no linear relationship, and 1 denotes a perfect positive linear relationship. In academic research, this coefficient is often used to assess the similarity between two vectors or waveforms.

3 Average term frequency-document frequency (ATF-DF) algorithm

3.1 Algorithm concept

Assuming that in the training set, there are M classes of text with known category labels, the general process of the ATF-DF algorithm is as follows:

First, all texts with known categories in the training set are tokenized into words. After tokenization and preprocessing, M class-specific lexicons (hereinafter referred to as lexicons) are constructed.

Next, the ATF-DF values for all terms are calculated from the lexicons, resulting in M class-specific feature vectors.

Finally, based on the calculated results, the terms are sorted in descending order according to their ATF-DF values, resulting in M class-specific feature vectors arranged in descending order.

The concept of the ATF-DF algorithm is illustrated in Fig 1, and the framework of the algorithm is depicted in Fig 2.

[Figure omitted. See PDF.]

3.2 The theory and procedure of the ATF-DF-WA algorithm

Expanding on Fig 2, the process of the ATF-DF algorithm is shown in Fig 3.

[Figure omitted. See PDF.]

As shown in Fig 3, the ATF-DF algorithm consists of three main steps:

STEP 1: Assuming the text is categorized into M classes, after tokenization, cleaning, and other preprocessing steps, the lexicon is obtained.

STEP 2: Calculate the class-specific feature weights for all terms in and obtain the class feature vector .STEP 2 consists of two specific sub-steps.

(1) Calculate the class-specific feature weights .

Assume that contains a total of K terms, where the k -th term is , , and its class-specific feature weight is calculated in the following order:

1. ①. Calculate , where the subscript i represents the corresponding i -th class lexicon, and k represents the k -th term in . is used to represent the class-specific feature weight of with respect to at the term level. ATF is the abbreviation for Average Term Frequency.

The traditional TF algorithm is used to extract term frequency features from individual texts and is not suitable for scenarios involving large datasets with multiple texts of known categories. The reasons are as follows: If TF is used to calculate the frequency of in , it may lead to instability in frequency. That is, the frequency of in individual texts does not tend toward a stable value; it may be high in one text but low in others, making the final frequency feature difficult to determine.

The core idea of ATF is that if a term has a higher average frequency within a specific category, then that term has a stronger ability to distinguish that category, and is therefore assigned a higher weight.

Therefore, this paper proposes calculating the average term frequency (ATF) of in to represent the class-specific feature weight of with respect to at the term level. The calculation equation is as follows:

(6)

Here, the subscript i in ATF indicates that the class-specific feature weight corresponds to the i -th class text , k is the subscript for , j represents the j -th text in , J is the total number of texts in that contain , and represents the TF value of in the j -th text.

1. ②. Calculate , where the subscript i represents the corresponding i -th class lexicon, and k is the subscript for . is used to represent the class-specific feature weight of with respect to at the document category level. DF is the abbreviation for Document Frequency.

The traditional IDF algorithm is based on the idea that if appears in only a few texts, it has a strong discriminative ability and can be used to identify or classify texts. It is evident that this algorithm is only applicable to scenarios where the text set does not have category labels. However, the scenario in this paper involves calculating the term frequency features of within a text set that already has category labels.

Therefore, this paper proposes calculating the value, with the core idea being: since already has category labels, the approach is the exact opposite of the traditional IDF algorithm. The more frequently appears in the texts within , the stronger the ability of to represent the category features. This study uses to represent the weight of in terms of its ability to characterize the text category at the text frequency level. The calculation equation is as follows:

(7)

Here, the numerator represents the total number of texts in that contain , while the denominator is the total number of texts in .

1. ③. Calculate the ATF-DF value, denoted as , which is used to comprehensively represent the class-specific feature weight of with respect to at both the term frequency and document category levels. The calculation equation is shown in (8):

(8)

Here, the subscript i indicates that corresponds to the i -th class lexicon, and k is the subscript for .

(2) Calculate the class feature vector .

Assuming has K terms, after calculating all terms to obtain , the result is , which is referred to as the class feature vector corresponding to . Its specific composition is as follows:

contains K elements, each of which consists of a two-dimensional array. In the two-dimensional array, the first sub-element is the term, and the second sub-element is the class-specific feature weight corresponding to that term. The k -th element of can be represented as .

STEP 3: Calculate and obtain the class feature vectors arranged in descending order.

Sort each element of in descending order based on the value of to obtain the class feature vector .

3.3 Comparison and analysis with related algorithms

In Section 3.2, while explaining the specific calculation steps of ATF-DF, two significant differences between ATF-DF and the TF-IDF algorithm have been highlighted:

1. (1). The method of term frequency calculation is different: TF calculates the term frequency weight of a specific term within a single text of an unknown category, whereas ATF calculates the average term frequency weight of a specific term within a text set of a known category (which includes multiple texts). The core idea is: If a certain term has a higher average frequency within that category, it has a stronger ability to represent that text category, and therefore, it is assigned a higher weight.

2. (2). The method of document frequency calculation is different: The basic theory behind IDF is that if a term appears frequently in many texts, it indicates that the term does not have a strong ability to distinguish between text categories. However, since DF is calculated based on a text set with known categories, its basic theory is exactly the opposite of IDF: If a certain term appears more frequently in a specific category of texts, it indicates that the term has a better ability to represent that text category.

Additionally, the ATF-DF algorithm has two significant differences from the TF-ICF described in Section 2.2:

The method of term frequency calculation is different: The differences between the TF and ATF algorithms are as previously described.

The method of category frequency calculation is different: ICF calculates the inverse category frequency, with its core idea being: The fewer times a term appears in a category, the more distinctive it is for text classification. In contrast, the calculation of the DF value has an opposing core idea: If a certain term appears more frequently in a specific category of texts, it indicates that the term has a better ability to represent that text category.

A comprehensive comparison of the TF-IDF, TF-ICF, and ATF-DF algorithms is shown in Table 1.

[Figure omitted. See PDF.]

3.4 ATF-DF algorithm experiments

3.4.1 Experimental design and selection of baseline algorithms.

Experimental Objective: To validate the effectiveness of ATF-DF in the feature extraction algorithm for text categorization.

Experimental Design: This study aims to assess the impact of different feature extraction methods on text classification performance. Specifically, we compare the classic TF-IDF with the ATF-DF algorithm in baseline text classification tasks.

Selection of the Baseline Algorithm: Since it is necessary to compare the performance of TF-IDF and ATF-DF within the same baseline text classification algorithm, the baseline text classification algorithm must meet the following criteria: In the feature extraction phase of this algorithm, while maintaining the basic principles of the original algorithm, it must support replacing the traditional TF-IDF algorithm with the ATF-DF algorithm for experimentation.

Among the candidate baseline algorithms, this paper examines three representative shallow learning algorithms: KNN, Naive Bayes Multinomial (NBM), and SVM, and analyzes whether each of them meets the above criteria:

1. ①. KNN: Since both TF-IDF and ATF-DF generate feature vectors that can be used for distance calculation, and there are no independence assumptions required, KNN is selected as a baseline algorithm.

2. ②. NBM: Since a key step in the NBM algorithm is calculating the probability of a term appearing in a document given its known category, which aligns with the fundamental concept of the ATF-DF algorithm, ATF-DF can be used to replace the original feature extraction algorithm for experimental comparison. Therefore, NBM is selected as a baseline algorithm.

3. ③. SVM: Since the SVM algorithm heavily relies on the independence of feature vectors, and the introduction of text category information in ATF-DF may increase the correlation between feature vectors, potentially affecting SVM’s requirement for feature independence, SVM is not selected as a baseline algorithm.

To rigorously compare the performance differences between TF-IDF and ATF-DF within the same baseline algorithm, the feature extraction process in both the KNN and NBM baseline algorithms is consistently implemented using the TF-IDF algorithm. For the sake of clarity and precision, these two baseline algorithms are henceforth referred to as TF-IDF-KNN and TF-IDF-NBM, respectively.

Experimental Evaluation and Analysis: By evaluating the experimental results of the baseline algorithms, the effectiveness of the ATF-DF algorithm is analyzed, leading to the formation of conclusions.

Additionally, the classification performance of the new text classification algorithm, based on ATF-DF, which is introduced later in this paper, further validates the effectiveness of ATF-DF from another perspective.

3.4.2 Description of experimental datasets and environmental parameters.

The experimental dataset used in this study is the THU Chinese Text Classification (THUCHNews) [30] corpus, (Download link: http://thuctc.thunlp.org/). Developed by the Natural Language Processing Laboratory of Tsinghua University, THUCHNews is a Chinese text classification dataset containing approximately 740,000 text samples.This corpus is renowned in the field of Chinese studies for its comprehensive text categories, extensive scale, and high quality, making it a commonly used resource in academic research. The corpus encompasses a total of 14 news categories, with their respective data categories displayed in Table 2. To facilitate training and testing, 80% of the text within each category is randomly selected to form the training dataset, while the remaining 20% is designated as the test dataset.

[Figure omitted. See PDF.]

The program was run on a computer with 16.0 GB of RAM, an AMD Ryzen 7 6800H with Radeon Graphics processor, and a clock speed of 3.20 GHz. The dependencies include scikit - learn (version 0.24.2) and pywt (version 1.1.1).

3.4.3 Sub-experiment 1: ATF-DF algorithm text category feature extraction experiment and result analysis.

The experimental steps are as follows: After segmenting all texts in the training dataset (The Chinese text segmentation tool employed is jieba [31]) and performing cleaning, a lexicon corresponding to 14 news categories, denoted as is obtained.

The key technical details of this step include:(a) The Jieba segmentation tool (version 0.42.1) is used in precise mode to perform Chinese text segmentation, ensuring the accurate extraction of each term.(b) The Chinese stopword list provided by Harbin Institute of Technology (stopword.txt) is used to remove common words with little semantic value, such as “的” (de) and “是” (shi), ensuring that meaningless terms do not interfere with the text analysis.(c) The text is cleaned by removing all non-Chinese characters, including punctuation marks, numbers, and special symbols. Additionally, meaningless whitespace characters are eliminated, and the text content is uniformly converted to lowercase, which is particularly applicable in scenarios containing English characters.(d) Texts with a length of fewer than 5 characters are filtered out, along with those containing a large number of noisy characters such as repeated punctuation marks, to ensure data quality and improve the effectiveness of subsequent text processing and analysis.

Using the ATF-DF algorithm, the term weights for all terms in are calculated to obtain , and the category feature vectors are derived.

The term weights are sorted in descending order to obtain the category feature vector .

Due to the high dimensionality of , for ease of presentation, the top 500 elements of all vectors are retained, as shown in Table 3.

[Figure omitted. See PDF.]

In Fig 4, six out of the 14 vectors are selected (Sports, Entertainment, Furniture, Lottery, Stock, Finance) to visualize their distribution. Here, the horizontal axis k represents the index of the term element in the category feature vector, and the vertical axis i represents the ATF-DF value of the k -th term in that category feature vector. Additionally, Fig 4 identifies the terms with the maximum and minimum category feature representation weights among the top 500 terms for the first six vectors, along with their respective weights.

[Figure omitted. See PDF.]

Results Analysis: Analysis of the data in Table 3 and Fig 4 is as follows:

The terms included in the category feature vectors are strongly related to the text categories. For instance, in the Sports category, terms such as “match,” “team,” and “player” all exhibit a strong relevance to sports.

The distribution of values across different category feature vectors varies, and the characteristics of distribution align with the linguistic features of different text categories.

For example, for (Entertainment category), 97% of values are concentrated between 0 and 0.005;in contrast, values for (Finance category) are unevenly distributed between 0 and 0.07.

To further analyze, Fig 5 illustrates the distribution of category feature representation weight values for terms in and . Here, the horizontal axis and represent the range intervals of the category feature representation weight values for all terms in and , respectively, and the vertical axis (percentage %) indicates the proportion of values in that interval range, expressed as a percentage.

[Figure omitted. See PDF.]

As can be seen from Fig 5, for , due to the broad range of topics covered by entertainment news corpora, the texts encompass a wide variety of terms. For , there is a characteristic of relatively small numerical differences, with 97% of being concentrated between 0 and 0.005, indicating that the vast majority of terms, when viewed individually, do not have a prominent categorical feature representation weight. As for , the terms are more focused on common Finance professional jargon, hence when calculating , the of Finance jargon is higher, and the of other terms is lower, resulting in an uneven distribution of between 0 and 0.07.

Overall, due to the inherent differences in text categories, there is a variation in the distribution of terms within the corpus, which is precisely the expected effect that the ATF-DF algorithm is designed to achieve.

In summary, based on the analysis of distribution characteristics, it can be concluded that the results and distribution characteristics of the category feature vectors extracted by ATF-DF are in line with the expected design of the ATF-DF algorithm.

3.4.4 Sub-experiment 2: comparative experiment of ATF-DF and TF-IDF on baseline algorithms and result analysis.

Parameter Settings for Baseline Algorithms: The TF-IDF-KNN algorithm is implemented using the Feature extraction and the API interface of sklearn.neighbors.KNeighborsClassifier from the machine learning toolkit scikit-learn. The ATF-DF-KNN algorithm modifies the aforementioned code by replacing the text feature extraction calculation with ATF-DF.

The TF-IDF-NBM algorithm also utilizes the scikit-learn library, implemented by invoking the sklearn.naive_bayes.MultinomialNB. The ATF-DF-NBM algorithm modifies the aforementioned code by replacing the text feature extraction calculation with the ATF-DF method.

The key parameter settings of the two baseline algorithms are shown in Table 4.

[Figure omitted. See PDF.]

Evaluation Metrics: The metrics precision, recall, and the F1 score are utilized to assess the algorithms, with their respective computational equations provided in equations (9), (10), and (11).

Precision: Refers to the proportion of samples that are truly positive among those classified as positive by the classifier. The computational equation is given by:

(9)

In this context, True Positive (TP): The number of samples that actually belong to the positive class and are correctly predicted as positive by the model.False Negative (FN): The number of samples that actually belong to the positive class but are incorrectly predicted as negative by the model.

Recall: Indicates the proportion of all actual positive samples that have been correctly identified. The computational equation is given by:

(10)

Where the meanings of TP and FN are as defined in equation (11).

F1 Score: A comprehensive measure that takes into account both precision and recall, it is the harmonic mean of precision and recall. The computational equation is given by:

(11)

Experimental Results: The comparative experimental calculations of ATF-DF-KNN and TF-IDF-KNN are presented in Table 5, while the comparative experimental calculations of ATF-DF-NBM and TF-IDF-NBM are shown in Table 6.

[Figure omitted. See PDF.]

In Table 5, the algorithms ATF-DF-KNN and TF-IDF-KNN are denoted by A1 and T1, respectively, and in Table 6, the algorithms ATF-DF-NBM and TF-IDF-NBM are denoted by A2 and T2, respectively. The specific computational equations are as follows:

(12)

In equation (12), it denotes the percentage increase or decrease of the ATF-DF-KNN algorithm compared to the TF-IDF-KNN algorithm, where and represent the respective metric values for the ATF-DF-KNN and TF-IDF-KNN algorithms. Here, m equals 1 for the Precision metric, m equals 2 for the Recall metric, and m equals 3 for the F1 metric.

In Table 6, the relevant values in equation (12) are substituted with the experimental outcomes of the ATF-DF-NBM and TF-IDF-NBM algorithms, with the interpretation of the symbols being analogous.

Analysis of Experimental Results: From the experimental results shown in Table 5, it can be observed that the F1-score of the ATF-DF-KNN algorithm in the “Furniture” category has increased by 138.61%, which is a significant improvement. This enhancement is primarily attributed to the ATF-DF algorithm’s accurate capture of category-specific features during the feature vector extraction stage.Specifically, in the “Furniture” category, keywords such as “furniture” and “design” are relatively concentrated, enabling the text classifier to more accurately distinguish this category. In contrast, the improvement in the F1-score for the “Finance” category is only 6.24%, possibly due to the broader distribution of keywords in this category, such as “fund” and “market” which have similar frequencies, thus increasing the classification difficulty. This indicates that the ATF-DF algorithm still has room for improvement when dealing with categories that exhibit low intra-class text similarity.

According to the experimental results in Table 6, the ATF-DF-NBM algorithm achieves an average improvement of 1.83% in the Precision metric and outperforms the TF-IDF-NBM algorithm in nearly half of the text categories. For example, in the “Real Estate” category, the precision of ATF-DF-NBM improves by 123.25%, demonstrating that the ATF-DF algorithm effectively captures distinctive feature words within this category, such as “property” and “apartment type,” thereby enhancing classification accuracy.In contrast, the improvement in the “Lottery” category’s Precision is relatively modest, at only 0.64%, likely due to the greater diversity of terms in this category, such as “home team” and “team,” which makes it challenging to distinctly differentiate feature vectors. The trends observed in Recall and F1-score performance align with these observations.

3.5 Experimental conclusions of the ATF-DF algorithm

The following conclusions can be drawn from the research presented in Chapter 2:

1. (1). The ATF-DF algorithm can accurately extract text category features using a “forward-thinking” frequency statistical approach.Experimental results indicate that ATF-DF outperforms traditional algorithms in terms of classification accuracy, recall, and F1-score, validating its practical application value.

2. (2). Compared to TF-IDF, ATF-DF demonstrates stronger category-awareness, enabling the precise extraction of feature terms through category-specific information.While TF-IDF focuses on global features and overlooks the distribution of terms within categories, ATF-DF significantly enhances classification accuracy by leveraging category-level term frequency calculations.

3. (3). ATF-DF more accurately reflects text category characteristics, improving text classification performance.It demonstrates substantial potential and value for practical applications in large-scale text classification tasks.

4 ATF-DF-WA algorithm

4.1 Algorithm concept

From the perspective of energy vibration, each character and word in the text can be regarded as a pulsating energy symbol. A piece of text can be seen as a two-dimensional, three-dimensional, or even multidimensional waveform diagram, a piece of music, or an interesting image or video, etc. similar to a piece of music or a visual image.In the research of this paper, since the ATF-DF algorithm only retains the frequency characteristics, the calculated class characteristic vector can be transformed into a two-dimensional waveform. Therefore, it is entirely feasible to design it is practical to design a new algorithm that fully utilizes the mature theories and tools of wavelet analysis for text classification calculations. Based on the aforementioned ideas, this paper proposes a new text classification algorithm.

Specifically, the approach and main steps of this new algorithm are:

1. (1). The class feature vectors obtained based on the ATF-DF algorithm are transformed into typical feature vectors for text categories (referred to as class-typical feature vectors), which are then used to generate the feature vector for the text to be classified.

2. (2). These feature vectors are converted into waveforms, and through wavelet analysis, the class-typical feature layer waveform, representing the characteristics of a specific text category, and the feature layer waveform of the text to be classified, representing its category characteristics, are obtained.

3. (3). Calculate the waveform similarity for the aforementioned feature layer waveforms, and complete text classification based on the calculation results.

The aforementioned new algorithm, being based on ATF-DF and further utilizing Wavelet Analysis (WA), is thus named the ATF-DF-WA algorithm. The conceptual diagram of its algorithmic approach is illustrated in Fig 6.

[Figure omitted. See PDF.]

4.2 The theory and procedure of the ATF-DF-WA algorithm

Specifically, the ATF-DF-WA algorithm can be divided into two phases: the feature extraction phase and the text classification phase, as illustrated in Fig 7. Fig 8 further elaborates on the detailed process of the algorithm.

[Figure omitted. See PDF.]

Specifically, after the completion of the first phase (the specific steps of which are described in the aforementioned ATF-DF algorithm), the steps of the second phase, as depicted in Fig 8, are as follows:

STEP 1: Perform dimensionality reduction on followed by a random permutation to obtain the class-typical feature vectors .

Upon completion of the first phase, two issues require resolution:

1. (1). represents a high-dimensional vector. In order to reduce computational complexity, there arises the issue of dimensionality reduction, involving the identification of its optimal dimensions that effectively capture the essential information while minimizing the complexity.

2. (2). Different wavelet basis functions, when applied to , exhibit variations in the effectiveness of feature extraction, thus presenting the issue of determining the most optimal wavelet basis.

In this paper, the resolution of the two aforementioned issues is accomplished through a unified computational process. Assuming the optimal dimensionality for a particular text category is , the process of determining and the optimal wavelet basis involves the following steps:

On the training dataset, a certain step size is selected to perform dimensionality reduction on . Starting from the last term, W is reduced in dimensions by discarding a number of terms equal to the step size, effectively truncating into vectors of different dimensions. Subsequently, for each dimension, a random function is applied to perform shuffling of the terms.

Twelve commonly used wavelet basis functions (‘db2’, ‘db4’, ‘dmey’, ‘haar’, ‘sym2’, ‘sym8’, ‘coif2’, ‘coif4’, ‘bior3.1’, ‘bior5.5’, ‘rbio3.1’, ‘rbio5.5’) are employed to perform text classification calculations. The dimension with the highest Precision is chosen as the optimal dimension. In the case of identical Precision results, the lower dimensional value of is selected.

The wavelet basis used when Precision is at its highest is also selected as the optimal wavelet basis.

The equation for calculating Precision involved in the aforementioned process is shown in equation (9).

To provide a clear exposition of the details, six additional points are added to elucidate the aforementioned calculation process:

The reason for sorting terms based on ATF-DF weights and then truncating from the last term backward is to retain terms in with higher ATF-DF values, as the preserved terms have better text category differentiation capabilities.

The vectors are truncated to different dimensions and then randomly shuffled again because if they were kept in descending order, the waveforms of would all be monotonically decreasing. This would cause the class typical feature layer waveforms and the feature layer waveforms of the text to be similar in distribution after subsequent wavelet decomposition, leading to poor performance in waveform similarity calculation.

Throughout the computational process mentioned above, the text classification algorithm adopted is the ATF-DF-WA algorithm proposed in this paper.

The theoretical explanation for selecting different wavelet bases for different text categories is as follows: Due to significant differences in the feature distribution across different text categories, it is crucial to choose a wavelet base that matches the category characteristics. Wavelet analysis decomposes text features, and different wavelet bases can better adapt to the characteristics of various texts. For example, sports texts may contain more proper nouns and key terms, while entertainment texts may include more emotional and dynamic expressions. Selecting the appropriate wavelet base can more effectively capture these feature differences, thereby improving classification performance.

The theoretical explanation for the dimensionality reduction strategy is as follows: In text feature vector processing, high-dimensional feature spaces increase the computational complexity of classification algorithms and may introduce redundant features and noise, leading to decreased classification performance. Dimensionality reduction removes low-weight and redundant features, preserving the primary category characteristic information. Wavelet analysis theory indicates that the main energy of a signal is typically concentrated in a few features. Therefore, dimensionality reduction not only enhances computational efficiency but also improves the robustness and performance of the classification model.

For a better understanding and a more intuitive explanation, examples of the calculation of the optimal dimension and the best wavelet basis for each text category are provided in the experimental section of this paper.

After obtaining and determining the optimal dimension for it, the corresponding with this optimal dimensionality becomes the class-typical feature vectors, denoted as . Mathematically, this can be expressed as:

(13)

Where is a row vector, which can be represented as:

(14)

The complete expression of the row vector elements for is as follows:

(15)

STEP 2: Obtain the Class-Typical Feature Layer Waveform by Wavelet Decomposition of

Decompose using the optimal wavelet basis, and select the first-level wavelet as the class-typical feature wavelet of . There are two reasons for choosing the first-level wavelet:

1. (1). Observation: It was observed during the experiments that the first-level wavelet retains richer details of waveform features compared to other levels.

2. (2). Calculation: Choosing the first-level and other-level waveforms as class-typical feature waveforms, calculations are performed on the training dataset. The results indicate that the average precision of classification is optimal for the first level. Detailed experimental data can be found in the experimental section of this paper.)

STEP3: The feature vector for q is computed.)

Tokenize and preprocess q, then calculate the feature vector based on . The specific calculation process is as follows:

Iterate through k from 1 to , comparing each term of (denoted as ) with all terms in q. If a term in q matches from (where represents the k -th term in ), then compute the k -th element value of according to (16); otherwise, set the k -th element value to 0.

(16)

Where represents the k -th element value of , represents the term frequency of in the corpus q, and is the DF value of in .

As an illustrative example, Fig 9 demonstrates the specific process of generating the feature vector corresponding to q using a particular . In Fig 9, assuming there are N terms in , let’s denote the k -th element of as , and suppose N is greater than K.)

[Figure omitted. See PDF.]

The specific calculation process for this example is as follows: starting from the first element of , since the term “competition” cannot be found in any terms of q, the value of is set to 0. The second element of , “record”, exactly matches the first element in q (indicating can be found in the elements of q). According to (16), the second element of , , is calculated as . The same procedure is applied for similar cases for other elements. Following the above rules, the process continues from the first element of until the -th element is processed, resulting in with the same dimension .

STEP 4: Perform a wavelet decomposition on to obtain the characteristic waveform of the feature layer q.

Decompose using the optimal wavelet basis for the text category, obtaining the first-level wavelet form for the sample to be classified, which is the feature-level waveform for .

STEP 5: Utilize waveform similarity to accomplish text classification.

Utilize the Pearson correlation coefficient shown in (5) to calculate the similarity between the feature-level waveform of q and the class-typical waveform of , and complete the text classification. The specific process is as follows:

Iterate A from 1 to M, conducting the following computations:

1. (1). Based on the characteristic waveform of the feature layer corresponding to q, use equation (16) to calculate the similarity between this feature layer waveform and the i -th class-representative characteristic waveform.

2. (2). If the calculated result exceeds a predefined threshold, q is classified as belonging to the i -th category, and the calculation ends.

3. (3). Otherwise, continue generating the feature-level waveform of q corresponding to the next feature layer waveform(Corresponding the next class-typical feature vector), and calculate the similarity between this feature-level waveform and the next class-typical feature-level waveform. Repeat this process until the similarity exceeds the threshold for successful classification.

4. (4). If none of the calculated results exceed the predefined threshold, take the index i corresponding to the maximum value of and classify q as belonging to the i -th class of text.

The reason for selecting the Pearson correlation coefficient to calculate the similarity of waveforms in this paper is: ① The Pearson correlation coefficient is suitable for discrete waveform data (waveform information sets that are discontinuous in time or space and exist in the form of discrete points). ② Although there are other methods that can also be applied to discrete data, the Pearson correlation coefficient is more intuitive and easier to interpret when assessing similarity. Considering that the ATF-DF-WA algorithm is a shallow learning algorithm, in order to retain its interpretable advantage, taking into account the characteristics of the data and the research objectives, the Pearson correlation coefficient method has become the choice of this paper.

4.3 ATF-DF-WA algorithm experiments

4.3.1 Description of experimental dataset and environmental parameters.

To fully validate the text classification performance of ATF-DF-WA, this experimental phase selected three Chinese datasets on the experimental dataset, namely THUCHNews described in section 3.4.2, Sogou Chinese Corpus (Sogou) [32](download link: http://b.mtw.so/5W8eMF), and Chinese News Text Categorization (CNTC) [33] (download link: https://aistudio.baidu.com/datasetdetail/125160/0).

The Sogou corpus is a widely used Chinese text classification dataset, sourced from Sogou News, and contains 17,910 text samples. The specific categories and the number of texts in each category are shown in Table 7.

[Figure omitted. See PDF.]

The CNTC corpus is a public dataset on the Baidu PaddlePaddle AI Studio platform, containing 668,854 text samples. The specific categories and the number of texts in each category are shown in Table 8.

[Figure omitted. See PDF.]

The reasons for selecting these two corpora are as follows: (1) Both corpora cover multiple news categories, providing comprehensive support for text classification tasks across different fields, making them suitable for multi-scenario text mining research. (2) The Sogou corpus has a balanced distribution of text samples across categories, while the CNTC corpus exhibits significant class imbalance. By combining these two datasets, which complement each other in terms of sample balance and imbalance, the study can offer comprehensive data support for addressing the issues of balanced and imbalanced category samples, forming an ideal foundation for comparative experimental research.

For all the corpora, 80% of the data is selected as the training dataset, and 20% is used as the test dataset.

Other computational environment parameters are the same as those in Section 3.4.2.

4.3.2 Sub-experiment 1: step-by-step calculation results and analysis of the ATF-DF-WA algorithm.

Optimal Dimension and Best Wavelet Basis Calculation Results: In this experiment, the THUCHNews dataset is used, for the 14 in the training dataset, a step size of 1000 is used to calculate their optimal dimension and the best wavelet basis. The calculation steps are described in STEP1 of Section 4.2, with results presented in Table 9.

[Figure omitted. See PDF.]

In this experiment, the results from Table 7 are also applied to the other two datasets.

Based on the calculation results from Table 7, set all dimensions of to correspond to , thereby obtaining 14 class-typical feature vectors , and represent them all using two-dimensional waveform diagrams.

As an example, Fig 10 illustrates the waveforms of two class-typical feature vectors, denoted as and . Here, the abscissa corresponds to the sequence number of the vector elements, and the ordinate represents the ATF-DF values of these elements, scaled up by 10,000 times for clarity.

[Figure omitted. See PDF.]

Fig 10 clearly shows that the waveform distribution characteristics of the class-typical feature vectors differ significantly across different text categories.

Selection Calculation Results of Class Typical Feature Layers: In this paper, undergoes 14 layers of wavelet analysis, resulting in 14 hierarchical waveforms. However, a specific layer of the waveform needs to be selected as the class-typical feature layer for subsequent computations. The selection process involves the following steps:

Perform a 14-layer wavelet decomposition on , which has a dimensionality of , to obtain the waveforms of each layer.

On the training dataset, each of the 14 layers of waveforms was used sequentially as the class typical feature layer waveform, and text classification was performed using the ATF-DF-WA algorithm proposed in this paper. The results are shown in Fig 11, where the horizontal axis L represents the L-th layer waveform obtained after decomposition, and the vertical axis Precision represents the average classification precision.

The waveform layer with the optimal Precision value is selected as the class-typical feature waveform layer. As can be observed from Fig 11, the first layer waveform exhibits the best experimental Precision value, thus it is chosen as the class-typical feature waveform layer for further computations.

[Figure omitted. See PDF.]

As an example, Fig 12 shows the waveforms of two typical feature layers of Categories and obtained after wavelet decomposition. In the Fig, the horizontal axis represents the sequence number of the waveform elements, while the vertical axis represents the amplitude of the waveform elements.

[Figure omitted. See PDF.]

From Fig 12, it can be observed that the typical feature waveform layer of Category not only retains rich detail information but also exhibits distinct waveform distribution characteristics.

The feature vector calculation result of a certain text q to be classified: Input the text q to be classified from the test dataset, perform word segmentation on the text, and then obtain 14 feature vectors () for the test text according to the algorithm described in STEP3 of Section 4.2.

As an example, Fig 13 shows the and for a test text q, where the horizontal axis represents the index of the vector elements and the vertical axis represents the vector element values amplified by a factor of 10,000.

[Figure omitted. See PDF.]

As seen in Fig 13, the feature vectors of text q also exhibit significant differences in their waveform distribution.

The waveform calculation result of the feature layer for a certain text q to be classified: Using the optimal wavelet basis corresponding to all () of text q, perform a 14-level wavelet decomposition according to the algorithm in STEP4 of Section 4.2, and retain the 14 first-level waveforms as the feature layer waveforms of text q.

As an example, Fig 14 illustrates two feature layer waveforms ( and ) for text q. In the Fig, the horizontal axis represents the index of the waveform elements, and the vertical axis represents the amplitude of the waveform elements.

[Figure omitted. See PDF.]

As shown in Fig 14, the feature layer waveform obtained from the wavelet decomposition of exhibits distinct characteristics in its waveform distribution, which aligns with the expectations of the algorithm design.

The text classification calculation result for a certain text q to be classified: The text classification calculation is completed according to the theoretical guidance of STEP5 described in Section 4.2.

As an example, Table 10 presents the calculation results of the similarity between the feature layer waveform of a certain text q and the class-typical feature layer waveform .

[Figure omitted. See PDF.]

In the example, it can be seen that is greater than the threshold, while the other values are below the threshold. Therefore, q is classified as the 14th category, which is financial text.

4.3.3 Sub-experiment 2: text classification results and analysis of ATF-DF-WA vs. traditional baseline algorithms.

To extensively validate the text classification performance of ATF-DF-WA, Sub-experiment 2 builds upon Sub-experiment 1 by adding the Sogou and CNTC datasets, in addition to the THUCHNews dataset. Calculations and analyses are then conducted on these three datasets, comparing the results with traditional baseline algorithms.

Selection of the Baseline Algorithm: As mentioned earlier, ATF-DF-WA does not utilize a deep learning framework and belongs to the category of shallow learning algorithms. However, to enable a comprehensive performance evaluation, this paper selects four baseline algorithms: the shallow learning algorithms TF-IDF-KNN and LibSVM, as well as the deep learning algorithms TextCNN [34] and TextRNN [35]. In terms of evaluation metrics, precision, recall, and F1 score are all recorded.

Baseline Algorithm Parameter Settings: The implementation and parameter settings of the TF-IDF-KNN algorithm are consistent with those used in the experiments of Section 3.

The implementation of the LibSVM algorithm uses an open-source library for SVM classification [36].

The implementation of the TextCNN and TextRNN algorithms was built using Keras [37].

Table 11 shows the settings for some key parameters.

[Figure omitted. See PDF.]

Experimental Calculation Results: Through experiments on the THUCHNews dataset, the evaluation metrics for the classification performance of the five algorithms across 14 categories, including Precision, Recall, and F1 score, are presented in Tables 12, 13, and 14, respectively; Across all three datasets, the average Precision, Recall, and F1 score evaluation metrics for the five algorithms are shown in Tables 15–17.

[Figure omitted. See PDF.]

Some notes on the above table:

1. (a). In the tables above, ① represents ATF-TF-WA, ② TF-IDF-KNN, ③ LibSVM, ④ TextCNN, and ⑤ TextRNN, The numbers highlighted in bold indicate the highest values among the five algorithms for the respective metrics.

2. (b). The data in the tables are the results of 20 runs. The results for ①, ②, and ③ are consistent across multiple runs, with negligible random variation, while the results for ④ and ⑤ show slight differences across runs, so the average value is taken.

3. (c). The calculation formula for in Tables 12–14 is as follows:

(17)

Where represents the percentage increase or decrease of the ATF-DF-WA algorithm compared to the other four algorithms, and represent the metric values of the ATF-DF-WA algorithm and the other four algorithms, respectively, where: m represents the corresponding metric: p for the Precision metric, z for the Recall metric, and f for the F1 score metric.; n corresponds to the algorithms as follows: n = 2 corresponds to the TF-IDF-KNN algorithm, n = 3 corresponds to the LibSVM algorithm, n = 4 corresponds to the TextCNN algorithm, and n = 5 corresponds to the TextRNN algorithm. In the equation, the “+” and “-” signs on the right indicate that the ATF-DF-WA algorithm’s metric is better or worse than that of the TF-IDF-KNN algorithm, respectively.

1. (d). The confidence intervals in Tables 12–14 are represented in the form of , where denotes the lower bound of the confidence interval, representing the minimum possible value of the population parameter at a given confidence level, and represents the upper bound, indicating the maximum possible value of the population parameter at the same confidence level.

(18)(19)(20)(21)(22)

In Equation (18), represents the degrees of freedom, and G represents the sample size;In Equation (19), denotes the standard error, and s refers to the sample standard deviation; In Equation (20), represents the sample mean; In Equation (21), denotes the minimum possible value of the population parameter at a given confidence level, and represents the significance level β (two-tailed) corresponding to ;In Equation (22), denotes the maximum possible value of the population parameter at the same confidence level, with other parameters having the same meaning as in Equation (21).

According to the t-Distribution Table, when G is 14, = 13, and β = 0.05, the corresponding value is 2.16.

1. (e). For a visual representation of the tables above, see Figs 15–17.

2. (f). The experimental results of the five algorithms on three different datasets are shown in Tables 15–17.

[Figure omitted. See PDF.]

Analysis of Calculation Results:

1. ①. The experimental data obtained from Tables 12–14 are analyzed as follows:

In terms of Precision, the ATF-DF-WA algorithm performs almost perfectly across 14 text categories, with a Precision of 1.0 in 10 categories. The Precision for the remaining 4 categories is higher than 0.99. This indicates that the ATF-DF-WA algorithm achieves extremely high accuracy in classification tasks, with a very low error rate. Compared to the other four baseline algorithms, ATF-DF-WA shows a significant improvement in Precision, with an average increase of 2.80% to 80.36%. This improvement not only reflects the model’s enhanced ability to accurately identify samples but may also be related to the advantage of using wavelet-based feature selection in the ATF-DF-WA algorithm, which allows it to more effectively capture the feature differences between text categories, thereby optimizing classification performance.

In terms of Recall, the ATF-DF-WA algorithm has an average Recall of 0.965, outperforming the other four algorithms, particularly in entertainment text, where Recall reaches 1.0. This indicates that the ATF-DF-WA algorithm is able to identify more true positive samples, even in cases of class imbalance, effectively reducing the miss rate. Compared to other algorithms, ATF-DF-WA’s Recall improved by 0.10% to 54.65% on average. It is worth noting that, despite the overall excellent performance, the ATF-DF-WA algorithm performs slightly worse in classifying home-related texts compared to Algorithms ② and ③. This may be due to unclear category features or uneven data distribution in the home category. Future optimization work may need to further focus on this category, such as optimizing the selection of the optimal dimension or wavelet basis.

In terms of F1 score, the ATF-DF-WA algorithm achieved high F1 scores across all text categories, with an average of 0.981, outperforming the other four algorithms, particularly in entertainment texts, where the F1 score is 1.0. This suggests that the algorithm not only has high Precision but also effectively mitigates the negative impact of low Recall on overall performance. The F1 score, which combines Precision and Recall, provides a more comprehensive evaluation, indicating that the ATF-DF-WA algorithm is able to balance classification accuracy while capturing as much relevant information as possible in practical applications. Compared to other algorithms, ATF-DF-WA’s F1 score improved by 2.62% to 60.82% on average. The F1 score remained high across all categories, except for the home category, where it was slightly lower than Algorithm ④, further supporting the need for targeted optimization for this category.

1. ②. The experimental data obtained from Tables 15–17 are analyzed as follows:

The experimental results further validate the versatility of the ATF-DF-WA algorithm across multiple datasets. Whether in terms of Precision, Recall, or F1 score, ATF-DF-WA’s performance is significantly better than that of the other four algorithms across all three datasets. Specifically, Precision increased by 8.21% to 69.56%, Recall increased by 6.60% to 75.00%, and F1 score increased by 7.93% to 76.82%. These results demonstrate that the ATF-DF-WA algorithm not only performs outstandingly on a single dataset but also consistently delivers superior classification performance across different datasets, showcasing its excellent generalization ability and cross-task adaptability.

4.3.4 Sub-experiment 3: text classification results and analysis of ATF-DF-WA vs. pre-trained model baseline algorithms.

In addition to traditional baseline algorithms, pre-trained models such as BERT have also demonstrated good performance on smaller datasets through fine-tuning. Therefore, this section further conducts a brief comparative study between ATF-DF-WA and such algorithms.

Dataset and Baseline Algorithm Selection: Reference [38] investigated the text classification performance of pre-trained BERT models and proposed a method of fine-tuning the BERT model and embedding it into CNN and RNN deep learning models to improve the accuracy and stability of news text classification. The same THUCNews dataset was also used in this study, facilitating a comparison of results. Therefore, this paper selects the three baseline algorithms reported in reference [38], namely BERT, BERT-CNN, and BERT-RNN, for the comparative study.

Experimental Calculation Results: The ATF-DF-WA algorithm is compared with the three baseline algorithms in terms of average Precision, average Recall, average F1 score, experimental running environment parameters, and training time. The results are shown in Table 18.

[Figure omitted. See PDF.]

Among these, the data for the first four evaluation metrics for the ATF-DF-WA algorithm comes from Table 18 of this paper, while the data for the three baseline algorithms comes from reference [38]. The data for the fifth evaluation metric (in hours) corresponds to the average value calculated from 20 independent training sessions for the ATF-DF-WA algorithm. The data for the BERT-CNN algorithm comes from the related study in reference [39], while the data for the BERT-RNN and BERT algorithms have not been reported in reference [39].

Analysis of Experimental Results: The data obtained from Table 18 are analyzed as follows:

In the text classification task, the ATF-DF-WA algorithm outperforms the three baseline algorithms in all evaluated metrics. Compared to the best-performing baseline algorithm, BERT-CNN, the ATF-DF-WA algorithm shows a 5.42% improvement in average Precision, a 2.66% improvement in average Recall, and a 4.36% improvement in average F1 score. This indicates that the ATF-DF-WA algorithm not only shows a significant improvement in correctly classifying positive examples (Precision), but also captures more positive examples in terms of Recall, resulting in a more balanced and superior overall performance.

From the perspective of training time and hardware requirements, the ATF-DF-WA algorithm uses only an AMD R7 6800H CPU and 16GB of RAM for computation, whereas the BERT-CNN algorithm utilizes more powerful hardware, including an AMD R5 3600 CPU, an NVIDIA GTX 1060 (6GB) GPU, and 16GB of RAM, with GPU acceleration for computation. Despite using GPU acceleration, the training time for BERT-CNN is still 3.5 hours, whereas ATF-DF-WA completes its training in just 1.98 hours. Without GPU acceleration, ATF-DF-WA demonstrates high efficiency, finishing the training in a significantly shorter time. This reveals its substantial time advantage, especially in resource-limited environments.

In conclusion, the ATF-DF-WA algorithm outperforms pre-trained model baseline algorithms in terms of classification accuracy, recall capability, and computational efficiency. Moreover, it demonstrates superior training efficiency, particularly in environments with limited computational resources.

4.4 Experimental Conclusion of the ATF-DF-WA Algorithm

The ATF-DF-WA algorithm demonstrates significant advantages in text classification tasks. By combining the ATF-DF algorithm with wavelet analysis, this algorithm efficiently extracts text features and achieves high Precision, Recall, and F1 scores across different text categories. The key conclusions are as follows:

1. (1). Improvement Over Traditional Baseline Algorithms, Compared to traditional baseline algorithms, the ATF-DF-WA algorithm shows improvements in all evaluated metrics, especially excelling in handling complex text categories.

2. (2). When compared to pre-trained model baseline algorithms, ATF-DF-WA also demonstrates superior performance in terms of text classification metrics and training time, highlighting its strong application potential.

5 Conclusions

5.1 Main conclusions

This paper proposes a novel text category feature extraction algorithm, ATF-DF, and further introduces a new ATF-DF-WA text classification algorithm by integrating wavelet analysis theory. The main conclusions of this study are as follows:

1. (1). The ATF-DF algorithm has accurate text category feature extraction capabilities: The ATF-DF algorithm can effectively extract feature vectors for text categories. The experimental results indicate that: ① The feature terms in the feature vectors proposed by this algorithm are closely related to text categories, and their feature value weights accurately reflect the representativeness and distinguishing capability of these terms within the text categories. ② Compared to the TF-IDF algorithm, the ATF-DF algorithm improved Precision by 2.80% and 80.36%, Recall by 0.10% and 54.65%, and F1 score by 2.62% and 60.82% on two text classification baseline algorithms, respectively. This indicates that ATF-DF has an advantage over TF-IDF in text category feature extraction performance.

2. (2). The ATF-DF-WA algorithm has a performance advantage: Compared to the four traditional shallow learning and deep learning baseline algorithms, the ATF-DF-WA algorithm performs exceptionally well in text classification tasks. The experimental results indicate that: the average Precision increased by 2.80% to 80.36%, the average Recall increased by 0.10% to 54.65%, and the average F1 score increased by 2.62% to 60.82%, demonstrating that this algorithm has a significant advantage in text classification performance.Compared to baseline algorithms based on pre-trained models, it also demonstrates advantages in both classification performance and training speed.

3. (3). The ATF-DF-WA algorithm has application advantages: The ATF-DF-WA algorithm not only fully leverages the statistical advantages of large data samples but also overcomes the drawbacks of deep learning algorithms, such as high data and computational resource requirements, poor interpretability, and complex parameter tuning. Therefore, it is a lightweight solution suitable for environments with limited training data and constrained computational resources.

4. (4). Exploration of introducing wavelet analysis: This study effectively applies wavelet analysis theory and tools to the field of text classification, providing a valuable exploration for innovation in this area.

5.2 Limitations and potential constraints

Despite the achievements of this study, there are some potential limitations:

1. (1). Limitations in feature representation: The current text-to-waveform conversion process retains only term frequency features, ignoring the positional relationships and interactions between terms. This restricts the model’s representational capacity and its ability to capture fine-grained information.

2. (2). Insufficient multilingual support: The ATF-DF-WA algorithm has been evaluated solely on Chinese texts, and its applicability to other languages has not yet been assessed. Therefore, the algorithm may face performance degradation or adaptability challenges when handling texts in different languages.

5.3 Future outlook

Future research will focus on the following aspects: (1) Retaining more textual information in the typical feature vectors to construct feature vectors with richer information, which can then be mapped into high-dimensional waveforms or even images for further processing. (2) Further expanding on the ideas presented in this study by exploring the integration of additional tools from signal processing and image processing with deep learning frameworks, with the aim of generating more research outcomes through interdisciplinary convergence. (3) Combining the ATF-DF algorithm with deep learning models, such as using the extracted text category feature vectors as input for neural networks (e.g., LSTM or Transformer). This approach could further enhance the accuracy of text classification and fully leverage the powerful representational capabilities of deep learning models. (4) Combining the ATF-DF algorithm with deep embedding methods, for example, by integrating the feature vectors generated by ATF-DF with vectors produced by deep embedding at the feature level. This can be achieved through techniques such as vector concatenation or weighted averaging. Alternatively, during model training, different sources of feature vectors can be used as inputs to build a hybrid model. (5) Transforming the feature vectors generated by ATF-DF into a format suitable for input into large models for prompt-tuning, in order to further enhance the classification performance of these large models. (6) Developing more shallow learning and deep learning text classification algorithms based on the ATF-DF algorithm.(7) Optimize and adapt the ATF-DF algorithm to support text classification tasks in multilingual environments.(8) Introduce automated parameter adjustment mechanisms or develop new wavelet basis functions to accommodate a broader range of task scenarios, thereby further enhancing the generality and applicability of the method.

References

1. 1. Qader WA, Ameen MM, Ahmed BI. An overview of bag of words; importance, implementation, applications, and challenges. 2019 International Engineering Conference (IEC), Erbil, Iraq, 2019, pp. 200–204

2. 2. Liu C, Sheng Y, Wei Z, et al. Research of text classification based on improved TF-IDF algorithm[C].//2018 IEEE International Conference of Intelligent Robotic and Control Engineering (IRCE). IEEE, 2018: 218-222.

3. 3. Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space [J]. arXiv preprint arXiv:1301.3781, 2013.

4. 4. İrsoy O, Benton A, Stratos K. Corrected CBOW Performs as well as Skip-gram [J]. arxiv preprint arxiv:2012.15332, 2020.

5. 5. Pennington J, Socher R, Manning C D. Glove: Global vectors for word representation [C].//Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014; 1532–43.

6. 6. DAI L, JIANG K. Chinese text classification based on FastText. Computer and Modernization. 2018;2018(5):35.

* View Article

* Google Scholar

7. 7. Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding [J]. arXiv preprint arXiv:1810.04805, 2018.

8. 8. Van Houdt G, Mosquera C, Nápoles G. A review on the long short-term memory model. Artificial Intelligence Review. 2020;53(8):5929–55.

* View Article

* Google Scholar

9. 9. Dey R, Salem F M. Gate-variants of gated recurrent unit (GRU) neural networks [C].//2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS). IEEE, 2017: 1597-1600.

10. 10. Subakan C, Ravanelli M, Cornell S, et al. Attention is all you need in speech separation [C].//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 21-25.

11. 11. Attieh J, Tekli J. Supervised term-category feature weighting for improved text classification. Knowledge-Based Systems. 2023;261:110215.

* View Article

* Google Scholar

12. 12. Parlak B, Uysal AK. The effects of globalisation techniques on feature selection for text classification. Journal of Information Science. 2021;47(6):727–39.

* View Article

* Google Scholar

13. 13. Parlak B. A novel feature and class-based globalization technique for text classification. Multimedia Tools and Applications. 2023;82(24):37635–60.

* View Article

* Google Scholar

14. 14. Parlak B, Uysal AK. A novel filter feature selection method for text classification: Extensive Feature Selector. Journal of Information Science. 2023;49(1):59–78.

* View Article

* Google Scholar

15. 15. JIA P, SUN W. A survey of text classification based on deep learning. Computer and Modernization. 2021;0(07):29–37.

* View Article

* Google Scholar

16. 16. Minaee S, Kalchbrenner N, Cambria E. Deep learning--based text classification: a comprehensive review. ACM Computing Surveys (CSUR). 2021;54(3):1–40.

* View Article

* Google Scholar

17. 17. Daneshfar F, Soleymanbaigi S, Nafisi A, Yamini P. Elastic deep autoencoder for text embedding clustering by an improved graph regularization. Expert Systems with Applications. 2024;238:121780.

* View Article

* Google Scholar

18. 18. Revathy V, Pillai A, Daneshfar F. LyEmoBERT: Classification of lyrics’ emotion and recommendation using a pre-trained model. Procedia Computer Science. 2023;218:1196–208.

* View Article

* Google Scholar

19. 19. Lewis M. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension [J]. arXiv preprint arXiv:1910.13461, 2019.

20. 20. Hubert KF, Awa KN, Zabelina DL. The current state of artificial intelligence generative language models is more creative than humans on divergent thinking tasks. Sci Rep. 2024;14(1):3440. pmid:38341459

* View Article

* PubMed/NCBI

* Google Scholar

21. 21. Zhang S, Li X, Zong M. Efficient kNN classification with different numbers of nearest neighbors. IEEE Transactions on Neural Networks and Learning Systems. 2017;29(5):1774–85.

* View Article

* Google Scholar

22. 22. Liu P, Zhao H, Teng J. Parallel naive Bayes algorithm for large-scale Chinese text classification based on spark. Journal of Central South University. 2019;26(1):1–12.

* View Article

* Google Scholar

23. 23. Charbuty B, Abdulazeez A. Classification based on decision tree algorithm for machine learning. Journal of Applied Science and Technology Trends. 2021;2(01):20–8.

* View Article

* Google Scholar

24. 24. Tong S, Koller D. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research. 2001;2(11):45–66.

* View Article

* Google Scholar

25. 25. Johnson DE, Oles FJ, Zhang T. A decision-tree-based symbolic rule induction system for text categorization. IBM Systems Journal. 2002;41(3):428–37.

* View Article

* Google Scholar

26. 26. Daneshfar F, Aghajani MJ. Enhanced text classification through an improved discrete laying chicken algorithm. Expert Systems. n.d.e13553.

* View Article

* Google Scholar

27. 27. Zhu J, Huai L, Cui R, Yin. Feature Extraction for Text Classification Based on Wavelet Analysis J. Journal of Chinese Information Processing, 2018, 32(11): 49-54.

28. 28. Shen H. A Study on Feature Selection Methods Based on Wavelet Analysis and Semantic Information [D]. Shanghai: Shanghai Jiao Tong University, 2005.

29. 29. Lee D T L, Yamamoto A. Wavelet analysis: theory and applicationsJ. Hewlett Packard journal, 1994, 45: 44.

30. 30. Maosong S, Jingyang L, Zhipeng G, Yu Z, Yabin Z, Xiance S, Zhiyuan L. THUCTC: An Efficient Chinese Text Classifier. 2016.

31. 31. Ding Y, Teng F, Zhang P, et al. Research on text information mining technology of substation inspection based on improved Jieba [C].//2021 International Conference on Wireless Communications and Smart Grid (ICWCSG). IEEE, 2021: 561-564.

32. 32. Sogou Labs, Sogou News Corpus: Chinese News Text Content Classification EB/OL. 2021. 2024-07-15. http://b.mtw.so/5W8eMF

* View Article

* Google Scholar

33. 33. Creative Commons, CNTC:Chinese News Text Content Classification [EB/OL]. 2022. 2024-07-15. https://aistudio.baidu.com/datasetdetail/125160/0

* View Article

* Google Scholar

34. 34. Guo B, Zhang C, Liu J, et al. Improving text classification with weighted word embeddings via a multi-channel TextCNN model [J]. Neurocomputing, 2019, 363: 366–74.

35. 35. Liu Z. Text classification of electricity policy information based on BERT-optimized TextRNN [C]. //2022 3rd International Conference on Computer Science and Management Technology (ICCSMT). IEEE, 2022: 76–9.

36. 36. Chang CC, Lin CJ. LIBSVM: A library for support vector machines. 2001. . [cited August 18, 2024]. Available from: https://www.csie.ntu.edu.tw/~cjlin/libsvm

* View Article

* Google Scholar

37. 37. Keras. Keras documentation EB/OL. [cited Date 2024-06-29]. Available: https://keras.io/.

* View Article

* Google Scholar

38. 38. Zhang X, Shao J. Research on news text classification based on the improved BERT-CNN model. TV Technology. n.d.;45(07):146–50.

* View Article

* Google Scholar

39. 39. Ye Rong, Shao Jianfei, Zhang Xiaowei. Research on Knowledge Distillation Method for News Text Classification Based on BERT-CNN J. Electronic Technology Applications, 2023, 49(1): 8-13.

Citation: Gao M, Li M, Ling Z, Zhong J, Ding H, Wu Q (2025) Wavelet analysis text classification algorithm based on typical features of data samples. PLoS One 20(6): e0319747. https://doi.org/10.1371/journal.pone.0319747

About the Authors:

Ming Gao