Content area
Mobile news classification systems face significant challenges due to their large scale and complexity. In this paper, we perform a comprehensive comparative study between traditional classification models, such as TextCNN and BERT based models and Large Language Models (LLMs), for the purpose of multi-label news categorization in mobile apps about the Chinese mobile news application. We evaluated the performance of conventional techniques, including a BERT model, along with Qwen models that have been tuned with instruction and fine-tuned using the LoRA technique, to optimize their effectiveness while preserving classification accuracy. Our experimental results show that BERT models perform best for multi-label classification with balanced datasets, while textCNN performs better for binary classification tasks. Our results also reveal that the LSTM and MLP classifiers consistently achieve the highest accuracy with text instruction prompts, while random embeddings achieve competitive accuracy. Furthermore, despite the low macro F1 scores due to class imbalance, consistent relative performance confirms the validity of our analysis. Our research reveals crucial information about the classification of automotive news, highlighting the importance of weighing technical prowess against deployment constraints when choosing model architectures.
Introduction
With the rapid advancement of mobile Internet and content distribution platforms, the role of news classification in content distribution and recommendation systems becomes progressively significant. Mobile news classification systems face significant challenges due to their large scale and complexity. Recent studies show that mobile applications are highly dependent on machine learning techniques for information classification, requiring high accuracy and efficiency1. Modern news platforms use scalable text classification systems to provide personalized content for millions of users, while large-scale systems demonstrate real-time processing capabilities for massive data streams. However, deploying deep learning models on mobile devices presents substantial challenges, with studies showing higher failure rates as the complexity of the model increases2. Multilabel news classification systems typically manage 100 to 300 categories with varying complexity levels3, and news recommendation systems must handle diverse types of content while maintaining response times below 200 ms to keep users engaged4. Research indicates that lightweight convolutional neural networks provide a crucial balance between performance and resource efficiency for mobile deployment5,6. In the automotive industry, an accurate classification of news articles is essential for providing personalized content, improving user engagement, and effectively managing content. However, this domain faces several critical challenges that limit the effectiveness of current classification systems.
The first challenge lies in the complexity of multi-label classification, which poses considerable obstacles7. Traditional single-label classification techniques do not handle contemporary news articles that frequently cover multiple subjects simultaneously8. For example, an article might also discuss market trends, technological advancements, and government policies. Research indicates that the diversity of news content requires advanced classification methodologies to discern subtle distinctions between categories9, while unsupervised methods using models such as BERTopic illustrate the challenges of achieving accurate classification at both the overarching and detailed levels10. The second challenge is that severe class imbalance creates significant performance problems in real-world applications11. Some labels are prevalent, while others are rare, leading classification systems to favor common categories while underperforming on niche topics. The third challenge involves the constraints of mobile deployments, which limit the practical application of advanced models due to the limited computational resources and the strict latency requirements imposed to ensure a satisfactory user experience.
Extensive research has attempted to address these challenges. Traditional machine learning methods with custom features have been widely used12,13, but they have limited semantic understanding capabilities. Deep learning models such as TextCNN and RNN have improved feature extraction14, but still struggle with long-term dependencies and complex semantic relationships. Pre-trained models such as BERT have significantly improved semantic understanding, but face deployment challenges on mobile devices due to computational requirements. Large language models (LLMs) show promise for complex semantic tasks, but have problems with computational demands and response time15. To address these limitations, we performed a comprehensive comparison of traditional models and LLMs for multilabel automotive news classification. Our study develops a systematic framework to evaluate different model architectures for mobile deployment, considering both performance measures and practical constraints16. We created a large data set of more than 200,000 Chinese automotive news articles with 150 unique labels, providing a robust benchmark for multilabel classification evaluation. The primary contributions are as follows.
We develop a comprehensive evaluation framework that combines theoretical performance metrics with the practical requirements of mobile deployment, such as latency and resource usage.
We provide clear, data-driven model selection guidelines based on specific application needs, demonstrating the trade-offs between accuracy, speed, and deployment complexity for different model architectures.
We present and validate optimization strategies for both traditional models and LLMs that enhance classification performance in mobile environments, particularly to address the challenge of severe class imbalance.
Text convolutional neural network (TextCNN)
TextCNN represents an innovative leap in neural network applications for text classification. such as the methods introduced by Guo17 and Liu18, this model employs various convolutional layers with different kernel sizes to derive local text features on different scales. Its ability to detect critical n-gram patterns and local semantics through these convolutional methods contributes to its success12,7. For handling multi-label classification tasks, modifications have been made to TextCNN, incorporating multiple parallel output layers or sigmoid activation functions for simultaneous label prediction19. Despite its computational efficiency and generally favorable results8, TextCNN faces challenges with long-range dependencies and intricate contextual interactions13.
Bidirectional encoder representations from transformers (BERT) models
The BERT models and their various adaptations have dramatically influenced text classification employing a strong pre-training paired with a fine-tuning strategy20. The bidirectional transformer structure improves comprehension of both context and semantic intricacies21. In Chinese text classification, custom versions such as BERT-wwm22 have emerged, adopting whole-word masking techniques to better accommodate the intricacies of the Chinese language. These models have delivered remarkable results in multi-label classification scenarios16,23, particularly in domain-specific applications24 and sentiment analysis25.
Qwen models
The Qwen models signify a major advance in Chinese language processing, based on previous work on multilabel classification. These models incorporate innovative pre-training goals and structural improvements in transformer architecture. Developed by Alibaba, Qwen is proficient in understanding and producing Chinese language26, with models ranging from compact 0.5B parameters to expansive 7B and 14B configurations. The Qwen series features both foundational models and instruction-tuned versions, such as Qwen-Chat, which show improved results in a variety of NLP tasks. These models show exceptional abilities in multilabel classification challenges27, especially in managing specialized vocabulary and contextual comprehension15,28.
Finetune and low-rank adaptation (LoRA) for Qwen
Recent advances in adaptation strategies for Large Language Models have resulted in improved solutions for text classification. A notable method is LoRA (Low-Rank Adaptation)29, recognized for its efficient fine-tuning of large models with minimal computational demands. This approach significantly reduces the number of parameters that need adjustment while preserving the performance of the model. In classification tasks, LoRA enables the adaptation of extensive models such as Qwen and its variations with reduced computational cost. This method shows great promise, especially in multilabel classification scenarios30,31, where conventional fine-tuning might be too expensive.
Mobile APP news classifier
Classification systems for mobile news face unique challenges that require tailored solutions. They must find a balance between computational efficiency, accuracy, and resource management. Recent research emphasizes lightweight models that maintain classification accuracy within the constraints of mobile environments 26. Key optimization methods include model quantization, pruning, and architectural adjustments to reduce model size and accelerate inference. Hybrid strategies that integrate edge computing with device processing increase performance and resource management 11, frequently using batch processing and cache to improve operational efficiency. Class imbalance is a significant challenge in multilabel news classification.
Methods
In this section, we describe the methodology for multi-label classification applied to news articles. For a summary comprised of tokens , our goal is to predict the label vector , where each shows the applicability of the j-th label. We formulate a function that seeks to minimize classification errors while ensuring computational efficiency is upheld.
TextCNN model training
We evaluated models that are of notable importance and popularity in the realm of Chinese news classification, specifically TextCNN, BERT, and Qwen models, examining their performance in binary and multi-label classification tasks. For multilabel classification, the TextCNN model handles input text with concurrent convolutional layers:
1
The size of the kernel is represented by h, with and linked to the convolution parameters, while and relate to the output layer parameters. Based on Kim’s study32, the model uses several kernel sizes, specifically 3, 4, and 5, to extract various n-gram features. For our approach, which is based on BERT, we employ the pre-trained Chinese BERT model together with a tailored classification head.
BERT model training
2
Here, H represents the contextual embeddings of BERT, while signifies the embedding for the classification token. A multilayer perceptron that uses ReLU activation converts the BERT representation into label predictions, with , , , and as adjustable parameters.
Qwen model training
The instruction tuning accompanied by Low-Rank Adaptation (LoRA) is effective in fine-tuning these extensive language models for classification tasks.
3
Here, represents the initial pre-trained weight matrix and is characterized by a low-rank decomposition into matrices and . To address the classification task, we can treat it as a text generation problem using the following instruction: Classify this news text into the appropriate category, Return only the category ID, and then the model is trained to produce the appropriate label ID as:
4
Here, denotes the token located at position t within the generated sequence, while means all the tokens generated prior to position t.
Loss function and training strategy
In the case of the TextCNN and BERT models, we employ a binary cross-entropy loss with applied weighting.
5
where the weight connected to the label j is defined by:6
Moreover, N signifies the total number of samples and indicates the number of samples related to the label j. This weighting method helps balance the impact of each label during training.
7
The loss for prompt tokens is masked to ensure that the model focuses on accurately predicting the correct sequence of labels.
The algorithm 1 and 2 presents two different training methods. The standard approach employs weighted binary cross-entropy loss with early stopping based on validation F1 scores to prevent overfitting. The LoRA fine-tuning method provides a parameter-efficient alternative for large-language models like Qwen, where only low-rank adaptation matrices are trained while the base model remains frozen, which uses instruction-based prompts and masked loss calculation, significantly reducing computational requirements while maintaining performance.
Results
Research questions
We conducted comprehensive experiments to evaluate different model architectures and embedding strategies for multi-label news classification. Our research addresses four key questions:
RQ1: How do TextCNN and BERT-based models perform when handling different numbers of labels in binary and multi-label scenarios?
RQ2: How do various embedding strategies affect the performance of TextCNN and BERT-based models?
RQ3: How effective are instruction-tuned large language models (LLMs), such as Qwen 33, for multi-label news classification tasks?
RQ4: How do prompt engineering, resource constraints, and label filtering strategies affect the practical deployment and optimization of Qwen models for automotive news classification?
Experimental setting
Table 1
Experimental configuration for multi-label news classification.
Parameter | Value |
|---|---|
GPU | ( ) |
CUDA/System RAM |
|
Storage |
|
Framework |
|
Source | Chinese mobile news app |
Dataset size | articles (8 : 2 split) |
Labels | 150 categories (multi-label) |
Avg. length | tokens |
Distribution | Long-tailed ( ) |
TextCNN kernels |
|
TextCNN filters | 128 |
BERT base model | ( ) |
Qwen (LoRA) | ( ) |
Batch size | 32 |
Optimizer | ( ) |
Learning rate | (cosine) |
Max epochs | 100 (Early Stop: 3) |
Metrics | Micro , Macro |
Inference hardware | Single |
Inference batch | 128 |
Precision |
|
Table 2. Label distribution statistics in automotive news dataset.
Category | % | Category | % |
|---|---|---|---|
New car analysis | 14.81 | Pre-sale | 1.65 |
New car pricing | 5.56 | Configuration exposure | 0.79 |
New car launch | 5.45 | Spy photos | 0.51 |
Release appearance | 3.04 | Real vehicle exposure | 0.45 |
Pre-heat | 2.56 | New car official images | 0.44 |
Declaration images | 0.40 | ||
New car arrival | 0.13 | ||
Others: 83.22% | |||
Our approach utilized a 24GB RTX4090 GPU alongside automotive news datasets. Although the data set is accessible for those in research roles (contact us via email), company policy allows us to provide only testing subsets. The TextCNN framework adopts kernel sizes of 7, 8, and 9 to discern semantic differences in texts of various lengths, each size having 128 kernels to achieve a balance between feature extraction and computational efficiency. To select key features of each convolutional layer, we applied the k-max grouping with k = 1. We conducted our experiments using Python 3.9 and PyTorch 2.0, randomly splitting the dataset into 80% for training, 10% for validation and 10% for testing while preserving the original label ratios. A fixed random seed of 42 was set to ensure consistent results. Detailed configuration settings are presented in Table 1, and the distribution of labels is shown in Table 2.
Evaluation metrics
To evaluate the performance of multilabel classification, we employ the conventional metrics used in label classification. For a test data set with n samples and m labels, let be the true vector of labels and be the predicted vector of labels for sample i -th, where each element . The following evaluation metrics are used:
Precision measures the proportion of correctly predicted positive labels among all predicted positive labels:
8
Recall represents the proportion of correctly predicted positive labels among all actual positive labels.
9
The F1 score provides a balanced measure between precision and recall.
10
Micro-averaging aggregates contributions from all classes, treating each instance equally to calculate the average metric.
11
Macro-averaging computes the metric separately for each class and averages the results, giving equal importance to each class, independent of their support:
12
where and represent the precision and recall of the j-th label, respectively. Micro-averaging emphasizes classes with more instances, making it suitable for imbalanced datasets. Macro-averaging treats all classes equally, providing overall performance insights across all labels regardless of their distribution.Experimental analysis
RQ1: How do TextCNN and BERT-based models perform when handling different numbers of labels in binary and multi-label scenarios?
Fig. 1 [Images not available. See PDF.]
Small sample dataset distribution.
Fig. 2 [Images not available. See PDF.]
Large-scale dataset distribution.
Fig. 3 [Images not available. See PDF.]
Balanced dataset distribution.
Table 3. Performance results on small sample dataset.
Model variant | Metrics | ||
|---|---|---|---|
Prec | Rec | F1 | |
TextCNN (Single-label) | 0.738 | 0.681 | 0.708 |
TextCNN (Single, Char) | 0.781 | 0.621 | 0.692 |
TextCNN (Binary) | 0.795 | 0.906 | 0.847 |
TextCNN (Binary, Char) | 0.785 | 0.890 | 0.834 |
TextCNN (Binary, Enh.) | 0.845 | 0.860 | 0.852 |
Table 4. Performance results on large-scale dataset.
Category | Metrics | ||
|---|---|---|---|
Prec | Rec | F1 | |
New car information | 0.895 | 0.962 | 0.927 |
Other | 0.721 | 0.466 | 0.567 |
Accuracy | 0.876 | ||
Macro average | 0.808 | 0.714 | 0.747 |
Weighted average | 0.865 | 0.876 | 0.865 |
Table 5. Performance results on balanced dataset.
Evaluation metric | Value |
|---|---|
Precision | 0.131 |
Recall | 0.006 |
F1-Score | 0.011 |
Correct predictions | 27 |
Total predictions | 206 |
Actual NewCarInfo | 4831 |
Performance comparison with other category included (TextCNN vs. BERT).
Category | TextCNN | BERT | ||||
|---|---|---|---|---|---|---|
Prec. | Rec. | F1 | Prec. | Rec. | F1 | |
Other | 0.900 | 0.945 | 0.922 | 0.000 | 0.000 | 0.000 |
Low-performance categories (F1 0.5) | ||||||
Spy photos | 0.000 | 0.000 | 0.000 | 0.640 | 0.404 | 0.496 |
New car pricing | 0.648 | 0.451 | 0.532 | 0.625 | 0.250 | 0.357 |
New car launch | 0.679 | 0.361 | 0.472 | 0.733 | 0.057 | 0.105 |
New car analysis | 0.612 | 0.371 | 0.462 | 0.679 | 0.223 | 0.335 |
Failed predictions (F1 = 0.0) | ||||||
Release appearance | 0.567 | 0.173 | 0.265 | 0.000 | 0.000 | 0.000 |
Pre-sale | 0.647 | 0.103 | 0.177 | 0.000 | 0.000 | 0.000 |
Pre-heating | 0.556 | 0.093 | 0.160 | 0.500 | 0.000 | 0.001 |
Configuration exposure | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Performance comparison with other category excluded (TextCNN vs. BERT).
Category | TextCNN | BERT | ||||
|---|---|---|---|---|---|---|
Prec. | Rec. | F1 | Prec. | Rec. | F1 | |
High-performance categories (F1 0.7) | ||||||
New car analysis | 0.900 | 0.953 | 0.920 | 0.946 | 0.990 | 0.967 |
Spy photos | 0.872 | 0.515 | 0.648 | 0.963 | 0.714 | 0.820 |
New car launch | 0.825 | 0.633 | 0.717 | 0.936 | 0.603 | 0.734 |
Moderate-performance categories (0.5 F1 0.7) | ||||||
New car pricing | 0.800 | 0.685 | 0.738 | 0.933 | 0.583 | 0.718 |
Pre-heating | 0.805 | 0.482 | 0.603 | 0.874 | 0.675 | 0.761 |
Release appearance | 0.756 | 0.596 | 0.667 | 0.876 | 0.631 | 0.733 |
Low-performance categories (F1 0.5) | ||||||
Configuration exposure | 0.344 | 0.089 | 0.142 | 0.575 | 0.615 | 0.594 |
We evaluated the performance of TextCNN and BERT models under various label complexities by systematically adjusting sample sizes and label distributions across three different experimental setups. Figure 1 shows small sample distributions, we observed a fairly balanced label distribution between NewCarInfo and Other categories (44% vs 56%), which presents an optimal environment for the evaluation of binary classification. Table 3 indicates that under these circumstances, TextCNN performs better in binary tasks compared to multilabel methods, with its enhanced binary version achieving the highest F1 score of 0.852. character encoding versions show varied results: single-label character encoding achieves higher precision (0.781) but lower recall (0.621). In contrast, analysis of large datasets (Fig. 2) reveals significant class imbalance issues, with the Other category comprising 80–90% samples. Table 4 shows that despite the imbalance, TextCNN delivers remarkable results on NewCarInfo (F1: 0.927) but struggles with the dominant Other class (F1: 0.567). The weighted average metrics (F1 score: 0.865) imply that the model adapts reasonably well to class imbalance when there is ample data. However, the balanced data set experiment (Fig. 3) reveals a significant performance drop due to forced balance. Table 5 reports notably poor results (F1: 0.011, Precision: 0.131, Recall: 0.006), with only 27 correct predictions of 206 for 4,831 actual NewCarInfo class, which suggests that naive balancing techniques can severely harm model performance. In multilabel classification analysis, the TextCNN and BERT models exhibit distinct behaviors. When the Other category is included (Table 6), TextCNN excels at handling this broad class (F1: 0.922), but struggles with more specific automotive labels. Conversely, BERT fails to predict the Other category (F1: 0.000) due to training exclusions, but performs better on specific categories like Spy Photos (F1: 0.496). Removing the Other category (Table 7) greatly improves the performance for both models, with BERT generally outperforming TextCNN in most categories. BERT excels in New Car Analysis (F1: 0.967) and Spy Photos (F1: 0.820), while TextCNN shows more consistent, but generally inferior performance.
Answer to RQ1 | |
TextCNN excels in binary classification with balanced datasets, achieving a F1 score of 0.852. However, performance drops significantly with unbalanced datasets and balanced sampling strategies. BERT demonstrates superior multilabel performance, achieving F1 scores of up to 0.967 for car-related categories. The presence of the Other category significantly affects model performance, emphasizing the importance of careful label design in real world applications. |
RQ2: How do various embedding strategies affect the performance of TextCNN and BERT-based models?
Fig. 4 [Images not available. See PDF.]
Data distribution of news labels.
Table 8. Comparative performance analysis of Chinese Pre-trained language models in automotive news classification.
Category | WWM BERT | BERT+TextCNN hybrid | ||||
|---|---|---|---|---|---|---|
Prec. | Rec. | F1 | Prec. | Rec. | F1 | |
Vehicle technical specifications | ||||||
New car analysis | 0.995 | 0.980 | 0.988 | – | – | – |
Spy photos | 0.927 | 0.981 | 0.953 | – | – | – |
Declaration images | 0.929 | 0.963 | 0.946 | – | – | – |
New car pricing | 0.935 | 0.903 | 0.919 | – | – | – |
Pre-sale | 0.906 | 0.897 | 0.902 | – | – | – |
New car launch | 0.932 | 0.866 | 0.898 | – | – | – |
Real car exposure | 0.857 | 0.923 | 0.889 | – | – | – |
Configuration exposure | 0.842 | 0.889 | 0.865 | – | – | – |
Official images | 0.917 | 0.815 | 0.863 | – | – | – |
Pre-heating | 0.902 | 0.798 | 0.847 | – | – | – |
Release appearance | 0.871 | 0.802 | 0.835 | – | – | – |
New car arrival | 0.727 | 0.800 | 0.762 | – | – | – |
Consumer guidance | ||||||
Vehicle brand analysis | – | – | – | 0.969 | 0.963 | 0.966 |
Comparison guide | – | – | – | 0.955 | 0.957 | 0.956 |
Single vehicle guide | – | – | – | 0.911 | 0.927 | 0.919 |
Car purchase manual | – | – | – | 0.939 | 0.900 | 0.919 |
Purchase techniques | – | – | – | 0.939 | 0.859 | 0.897 |
Evaluation guide | – | – | – | 0.904 | 0.906 | 0.905 |
Test drive | – | – | – | 0.876 | 0.919 | 0.897 |
Multiple vehicle guide | – | – | – | 0.928 | 0.852 | 0.888 |
Marketing guide | – | – | – | 0.971 | 0.745 | 0.843 |
Vehicle showcase | – | – | – | 0.886 | 0.795 | 0.838 |
Car sharing | – | – | – | 0.945 | 0.729 | 0.823 |
Dealership pricing | – | – | – | 0.829 | 0.782 | 0.805 |
Fig. 5 [Images not available. See PDF.]
Performance heatmap: embedding vs classifier architectures.
Table 9. Embedding vs classifier performance.
Embedding | Best classifier | Accuracy | Worst classifier | Accuracy |
|---|---|---|---|---|
BERT-base-uncased | LSTM/MLP | 0.309 | TextCNN | 0.289 |
Multilingual-BERT | LSTM/MLP/Linear | 0.309 | TextCNN | 0.304 |
Random-embedding | LSTM | 0.309 | TextCNN | 0.305 |
Word2Vec-Style | All | 0.309 | – | 0.309 |
We evaluated two embedding techniques for a range of classification tasks. Figure 4 illustrates the notable class imbalance in the dataset, with New Car Analysis accounting for 67%, while less prevalent categories like New Car Arrival constitute only 1%. Table 8 offers a comparison of the embedding strategies we used. The WWM BERT technique uses a Chinese pre-trained BERT model with whole word masking to classify news, as highlighted in 22. However, the BERT+TextCNN approach combines BERT for embeddings with TextCNN for classification. WWM BERT excels in the more frequent categories, achieving an F1 score of 0.988 for New Car Analysis and maintaining excellent performance (F1 > 0.85) in the primary categories. In contrast, the hybrid BERT + TextCNN also performs well in content-oriented categories, showing F1 scores of 0.966 for Vehicle Brand Analysis and 0.956 for the comparison guide. Figure 5 shows the performance interconnections between the embedding methods and the classifier models. Our ablation study, presented in Table 9, analyzes 8207 samples with 605 labels. LSTM and MLP classifiers achieve the highest accuracy (0.309) consistently, while TextCNN’s performance is comparatively lower; LSTM and MLP seem to be more adept at handling imbalanced multiclass challenges, as they surpass TextCNN across all types of embedding. Interestingly, random embeddings collect competitive accuracy (0.309) with LSTM, suggesting that severe class imbalance diminishes the benefits of sophisticated embeddings. Although macro F1 scores are low due to class imbalance, consistent relative performance lends credibility to our analysis (Table 10).
Answer to RQ2 | |
Different embedding methods can lead to varied performance results. WWM BERT shows superior results in news classification tasks when using specialized Chinese embeddings, reaching F1 scores up to 0.988. Meanwhile, the hybrid BERT + TextCNN model also demonstrates robust content classification proficiency, with F1 scores peaking at 0.966. |
RQ3: How effective are instruction-tuned large language models (LLMs), such as Qwen 33, for multi-label news classification tasks?
Fig. 6 [Images not available. See PDF.]
Training loss curve for Qwen1.5-1.8B.
Fig. 7 [Images not available. See PDF.]
Training loss curve for Qwen1.5-4B.
Fig. 8 [Images not available. See PDF.]
Training loss curve for Qwen2-0.5B.
Fig. 9 [Images not available. See PDF.]
Training loss curve for Qwen2-1.5B.
Fig. 10 [Images not available. See PDF.]
Training loss curve for Qwen2.5-0.5B.
Fig. 11 [Images not available. See PDF.]
Training loss curve for Qwen2.5-1.5B.
Fig. 12 [Images not available. See PDF.]
Training loss curve for Qwen2.5-3B.
Table 10Comparative analysis of foundation models with LoRA Fine-tuning (rank=8, , similarity threshold = 0.8). All metrics are Micro-Averages over 5 experimental runs.
Model architecture | Evaluation metrics (mean ± SD) | |||
|---|---|---|---|---|
Precision | Recall | F1 | Accuracy | |
Qwen1.5-1.8B-Chat |
|
|
|
|
Qwen1.5-4B-Chat |
|
|
|
|
Qwen2-0.5B-Instruct |
|
|
|
|
Qwen2-1.5B-Instruct |
|
|
|
|
Qwen2.5-0.5B-Instruct |
|
|
|
|
Qwen2.5-1.5B-Instruct |
|
|
|
|
Qwen2.5-3B-Instruct |
|
|
|
|
Fig. 13 [Images not available. See PDF.]
Training loss curve for LoRA Rank 2 (Qwen2.5-1.5B-Instruct).
Fig. 14 [Images not available. See PDF.]
Training loss curve for LoRA Rank 4 (Qwen2.5-1.5B-Instruct).
Fig. 15 [Images not available. See PDF.]
Training loss curve for LoRA Rank 8 (Qwen2.5-1.5B-Instruct).
Fig. 16 [Images not available. See PDF.]
Training loss curve for LoRA Rank 16 (Qwen2.5-1.5B-Instruct).
Fig. 17 [Images not available. See PDF.]
Training loss curve for LoRA Rank 32 (Qwen2.5-1.5B-Instruct).
Table 11Performance comparison of different LoRA ranks on Qwen2.5-1.5B-Instruct (similarity threshold=0.8). Metrics reported are micro-averages (mean ± SD) from 5 runs.
Metrics | LoRA rank | ||||
|---|---|---|---|---|---|
2 | 4 | 8 | 16 | 32 | |
Precision |
|
|
|
|
|
Recall |
|
|
|
|
|
F1-score |
|
|
|
|
|
Accuracy |
|
|
|
|
|
Table 12. Per-class performance analysis for Qwen2-1.5B-Instruct (LoRA Rank 8).
Class example | Support (samples) | Precision | Recall | F1-score |
|---|---|---|---|---|
High-frequency classes | ||||
New car models | 2850 | 0.8512 | 0.8245 | 0.8377 |
Electric vehicles | 1975 | 0.7954 | 0.7533 | 0.7738 |
Medium-frequency classes | ||||
Auto repair | 312 | 0.4108 | 0.3557 | 0.3812 |
Car insurance | 155 | 0.3529 | 0.2814 | 0.3132 |
Low-frequency classes | ||||
Vintage car auctions | 25 | 0.0000 | 0.0000 | 0.0000 |
Autonomous driving policy | 12 | 0.0000 | 0.0000 | 0.0000 |
The Qwen model series (ranging from to parameters, including models like , , and ) underwent evaluation via with a on a dataset of size . The graphs depicting training loss (Figs. 6, 7, 8, 9, 10, 11 and 12) demonstrate that the series consistently outshone and , with more recent architectures achieving quicker and more stable convergence. This indicates that architectural innovation holds more significance than model size for this particular multi-label task. Notably, the model delivered the best results ( , Precision: ), although still struggle with fidelity in multi-label classifications. A deeper evaluation of the training loss (Figs. 13, 14, 15, 16 and 17) underscored the importance of : the setting achieved a 63.3% reduction in loss and exhibited the lowest metric variance ( , Table 11), establishing it as the optimal compromise between capacity and stability since other ranks performed poorly ( for ranks 2 and 4) or were unstable. Analyzing performance per class for the best model (Table 12), it was found that the moderate micro- score was mainly driven by strong outputs in a few high-frequency classes (e.g., ), while results were unsatisfactory ( ) for most low-frequency labels. It has been verified that class imbalance poses a major challenge and is responsible for the notably low macro- scores across all models (Table 15).
Answer to RQ3 | |
Models trained with Qwen instruction techniques show moderate success in classifying automotive news. The Qwen2-1.5B-Instruct model reaches the highest performance level (F1: 0.2744) with a LoRA rank of 8, indicating that structural enhancements have a greater impact than the size of the model itself. Nevertheless, the overall performance is still restricted compared to conventional discriminative models, implying that LLMs encounter intrinsic difficulties in handling multi-class classification tasks, even with the benefit of instruction tuning. |
RQ4: How do prompt engineering, resource constraints, and label filtering strategies affect the practical deployment and optimization of Qwen models for automotive news classification?
Table 13
Diagnostic analysis of instruction prompt sensitivity in Qwen2.5-1.5B-Instruct classification.
Prompt type | Prompt template | Acc. | M-F1 | -F1 |
|---|---|---|---|---|
Specific | Please classify this automotive news article into the appropriate category. Return only the category ID | 0.3064 | 0.0008 | 0.3064 |
Direct | Classify the following automotive news article. Return only the category ID | 0.3063 | 0.0008 | 0.3063 |
Conditional | Given this automotive news article, determine its category. Return only the category ID | 0.3025 | 0.0008 | 0.3025 |
Variant | What category does this automotive news article belong to? Return only the category ID | 0.3064 | 0.0008 | 0.3064 |
Analytical | Analyze this automotive news and provide the category ID | 0.2667 | 0.0008 | 0.2667 |
Performance variance ( ) | 0.0002 | 0.0000 | 0.0002 | |
Coefficient of variation | 5.2% | 0.0% | 5.2% | |
Table 14. Misclassification sensitivity analysis across prompt templates.
Prompt type | Misclassified | Error rate (%) |
|---|---|---|
Specific | 5692 | 5.692 |
Direct | 5689 | 5.689 |
Conditional | 5760 | 5.760 |
Variant | 5692 | 5.692 |
Analytical | 5708 | 5.708 |
Performance range | 71 | 0.071 |
Table 15. Computational resource analysis of classification models.
Model | Size (GB) | VRAM (GB) | Time (ms) | Throughput (s/s) |
|---|---|---|---|---|
BERT-base-uncased | 0.41 | 0.42 | 3.7 | 270.24 |
BERT-multilingual | 0.66 | 0.67 | 3.5 | 284.43 |
TextCNN | 0.15 | 2.50 | 45.2 | 22.10 |
Qwen2.5-1.5B | 5.75 | 5.76 | 13.1 | 76.09 |
Performance comparison under different label filtering strategies.
Strategy | Threshold | Excluded | Classes | Samples | Accuracy | Macro-P | Macro-R | Macro-F1 |
|---|---|---|---|---|---|---|---|---|
All Data | – | 0 | 605 | 8207 | 0.317 | 0.0016 | 0.0047 | 0.0024 |
P90 (>1) | 1.0 | 338 | 267 | 267 | 0.000 | 0.000 | 0.000 | 0.000 |
P95 (>1) | 1.0 | 338 | 267 | 267 | 0.000 | 0.000 | 0.000 | 0.000 |
P99 (>1) | 1.0 | 338 | 267 | 267 | 0.000 | 0.000 | 0.000 | 0.000 |
Mean+1.5 STD | 171.7 | 6 | 599 | 4552 | 0.073 | 0.0051 | 0.0115 | 0.0050 |
Mean+2 STD | 224.4 | 3 | 602 | 5150 | 0.095 | 0.0034 | 0.0107 | 0.0041 |
Mean+3 STD | 329.8 | 1 | 604 | 5692 | 0.090 | 0.0047 | 0.0084 | 0.0026 |
Fixed >500 | 500 | 1 | 604 | 5692 | 0.093 | 0.0026 | 0.0103 | 0.0037 |
Fixed >1000 | 1000 | 1 | 604 | 5692 | 0.089 | 0.0041 | 0.0086 | 0.0031 |
Fixed >2000 | 2000 | 1 | 604 | 5692 | 0.095 | 0.0057 | 0.0091 | 0.0038 |
We examine the practical optimization and deployment challenges associated with models, specifically targeting prompt sensitivity, computational efficiency, and label filtering. Our results revealed a significant sensitivity to prompts: domain-specific instructions (Specific, Direct, Variant) achieved a peak precision of , outperforming the baseline of 0.267 by , highlighting the crucial role of explicit context. Carefully crafted prompts also maintained high stability, exhibiting minimal performance variation ( , ) and consistent error rates between types ( to ), confirming model reliability (missclassification specifics: Tables 13, 14). In terms of computational resources (Table 15), provided the best performance with limited resources, showed high latency despite its compact size, and demonstrated good efficiency for intricate tasks. Regarding label filtering (Table 16), employing all categories produced reasonable accuracy ( ) but compromised macro-metrics ( ). The Mean strategy (threshold 224.4) proved to be the optimal compromise.
Answer to RQ3 | |
This research reveals three key optimization findings: (1) prompt engineering: Domain-specific instructions achieve an improvement in performance of 14. 9% with stable variance (CV=5.2%); (2) Resource efficiency: Qwen2.5-1.5B requires 5.76GB VRAM with 76.09 samples/s throughput, providing reasonable efficiency for complex tasks; (3) Label filtering: Mean+2 STD strategy (threshold 224.4) optimally balances performance (9.5% accuracy) and data retention (62.8%), suggesting hierarchical labeling as the preferred approach to class imbalance in automotive news classification. |
Discussion
Analysis of experimental results
Our study reveals crucial insights into automotive news classification. BERT models strike an optimal balance of accuracy and efficiency, whereas TextCNN prioritizes resource efficiency at the cost of precision. LLMs like Qwen exhibit variable performance, making them ill-suited for this application. Domain-specific embeddings considerably enhance performance in all models, highlighting the advantage of specialized knowledge over larger general models. The choice of model should align with deployment limitations. Environments with abundant resources can implement BERT or LLMs for tasks requiring high accuracy, while those with constraints should lean towards TextCNN combined with domain embeddings. Inference times recorded as TextCNN: 15ms, BERT: 45ms, LLMs: 160ms+, emphasize architectural trade-offs needed for real-time mobile applications. Our results endorse hybrid strategies that choose models based on article complexity and resources, suggesting a tiered system using TextCNN for simpler content and BERT for more complex pieces to balance efficiency with accuracy. Limitations include dataset range, hardware-specific LLM evaluation, and anticipated performance declines with evolving automotive lexicon. Future research should look into model compression and specialized pre-training to close performance gaps while ensuring practical deployment.
Qwen’s classification limitations
Although Qwen models underwent instruction tuning, they performed poorly in automotive news classification due to architectural constraints. Generative language models 34 misalign with classification requirements, producing unexpected output beyond the necessary labels and raising reliability concerns 35. Our prompt sensitivity analysis shows minimal performance variation (CV = 5. 2%) between instruction formats. Systematic prompting methods 36 cannot overcome fundamental limitations, with performance remaining significantly inferior to discriminative models such as BERT 37 despite domain-specific improvements (14. 9% increase in precision). The severe class imbalance in the dataset (605 categories) particularly challenges generative models. This imbalanced data problem 38 is compounded by multi-label learning difficulties, where traditional solutions such as SMOTE prove inadequate. Runtime optimization techniques show limited effectiveness for generative architectures compared to discriminative models. These results indicate that generative models are less suitable for precise multiclass classification in specialized domains compared to purpose-built discriminative architectures.
Device deployment optimization
Our research results suggest various optimization techniques for mobile deployment: (1) Knowledge Distillation: Implement smaller student models using FitNets and attention transfer to achieve a size reduction of 75% while maintaining 90–95% performance; (2) Model Compression: Employ deep compression with structured pruning 39 and quantization of INT8 40 to decrease model size by 40–75%; (3) Hybrid Architecture: Use TextCNN for simple tasks and BERT for complex texts, following mobile-optimized designs 41, reducing the average inference time by 60%; (4) Dynamic Selection: Enable automatic model selection through early exiting and run-time pruning based on the complexity of the article and the availability of resources. These optimizations decrease the inference time from 160 ms to 40–50 ms while preserving the classification accuracy of 90%.
Cross-domain generalizability
Our approaches exhibit notable cross-lingual and cross-domain flexibility. TextCNN achieves language neutrality by implementing a design free from specific architectural constraints and adapts using tailored embeddings. BERT-based methods show strong transferability through the use of multilingual models: mBERT facilitates cross-lingual transfer without the need for targeted multilingual training, XLM-R enables extensive unsupervised cross-lingual learning, and straightforward fine-tuning techniques improve task-oriented performance. Qwen models are optimized for translation across languages by fine-tuning LoRA while preserving computational efficiency. Our hierarchical classification strategy effectively handles category ambiguities and class imbalance, demonstrating success in specialized fields such as medical ICD coding and legal judgment prediction. These strategies extend their utility to the scientific literature, biomedical documents, and social media content, laying the groundwork for multilingual news classification and information extraction tasks.
Conclusion
In this paper, we conducted an in-depth analysis of various model architectures for multi-label automotive news classification, including traditional models like TextCNN, pre-trained language models (PLM) such as BERT, and extensive language models (LLM) like Qwen variants. Our examination assessed both classification performance and practical deployment challenges. The BERT models demonstrated superior results, achieving higher Micro-F1 scores than TextCNN. However, Qwen-type LLMs require additional training and regulation to ensure prediction stability. The choice of optimal architecture depends on the needs for the deployment and the specific application criteria. In particular, employing domain-specific embeddings significantly increased performance, especially for simpler models such as TextCNN, highlighting the value of domain adaptation in niche news classification. The central challenge lies in finding a balance between computational efficiency and classification accuracy. Although LLMs possess a strong grasp of complex semantics, their high computational intensity poses challenges for mobile deployment. In contrast, lighter models such as TextCNN are preferable for straightforward classification tasks in resource-constrained environments. This research offers practical advice for developers building multi-label news classification systems, emphasizing the need to weigh technical prowess against deployment limitations when selecting models. Future research should focus on model compression, adaptive model selection, and domain-specific pre-training to advance both performance and efficiency in mobile news classification.
Acknowledgements
This work was supported by Natural Science Project of Guangdong University of Science and Technology (GKY-2025BSQDK-3). The authors also thank Beijing Bitauto Information Technology Co., Ltd . for providing partial data and GPU computational resources utilized in this research.
Author contributions
D.Y. conceptualized and designed the study, performed experiments, and led the writing of the manuscript. G.L. was responsible for the collection and processing of the data. B.L. executed the core computational analysis and testing. S.L. reviewed the manuscript. All authors reviewed and approved the final version of the manuscript.
Data availability
The experimental code and the automotive industry news dataset are publicly available. The code can be accessed on GitHub at https://github.com/davidyuan666/LabelsClassifier4News, and the dataset is available on Figshare at https://figshare.com/articles/dataset/Textcnn_qwen_experiment_dataset/28794602.
Declarations
Competing interests
The authors declare no competing interests.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1. Pérez Arteaga, S; Sandoval Orozco, AL; García Villalba, LJ. Analysis of machine learning techniques for information classification in mobile applications. Appl. Sci.; 2023; 13,
2. Chen, Z. et al. An empirical study on deployment faults of deep learning based mobile applications. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 674–685 (2021).
3. Patil, S; Lokesha, V. Multi-label news category text classification. J. Algebr. Stat.; 2022; 13,
4. Feng, C; Khan, M; Rahman, AU; Ahmad, A. News recommendation systems-accomplishments, challenges & future directions. IEEE Access; 2020; 8, pp. 16702-16725. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.2967792]
5. Chen, Y; Zheng, B; Zhang, Z; Wang, Q; Shen, C; Zhang, Q. Deep learning on mobile and embedded devices: State-of-the-art, challenges, and future directions. ACM Comput. Surv. CSUR; 2020; 53,
6. Hafiz, AM. A survey on light-weight convolutional neural networks: trends, issues and future scope. J. Mob. Multimed.; 2023; 19,
7. Zhang, M-L; Zhou, Z-H. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Eng.; 2006; 18,
8. Tsoumakas, G. & Vlahavas, I. Random k-labelsets: An ensemble method for multilabel classification. In European Conference on Machine Learning, 406–417 (2007).
9. Amsalem, E; Fogel-Dror, Y; Shenhav, SR; Sheafer, T. Fine-grained analysis of diversity levels in the news. Commun. Methods Meas.; 2020; 14,
10. Sayed, M. A., Braşoveanu, A. M., Nixon, L. J., & Scharl, A. Unsupervised topic modeling with BERTopic for coarse and fine-grained news classification. In International Work-Conference on Artificial Neural Networks, 162–174 (2023).
11. Chawla, N. V. & Sylvester, J. Exploiting diversity in ensembles: Improving the performance on unbalanced datasets. In International Workshop on Multiple Classifier Systems, 397–406 (2007).
12. Godbole, S; Sarawagi, S. Discriminative methods for multi-labeled classification. Pacific-Asia Conf. Knowl. Discov. Data Mining; 2004; 1, pp. 22-30. [DOI: https://dx.doi.org/10.1007/978-3-540-24775-3_5]
13. Fürnkranz, J; Hüllermeier, E; Loza Mencía, E; Brinker, K. Multilabel classification via calibrated label ranking. Mach. Learn.; 2008; 73, pp. 133-153. [DOI: https://dx.doi.org/10.1007/s10994-008-5064-8]
14. Read, J., Pfahringer, B., Holmes, G. & Frank, E. Classifier chains for multi-label classification. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 254–269 (2009).
15. Rokach, L; Schclar, A; Itach, E. Ensemble methods for multi-label classification. Expert Syst. Appl.; 2014; 41,
16. Fan, R.-E. & Lin, C.-J. A study on threshold selection for multi-label classification. In Department of Computer Science, National Taiwan University, 1–23 (2007).
17. Guo, B; Zhang, C; Liu, J; Ma, X. Improving text classification with weighted word embeddings via a multi-channel TextCNN model. Neurocomputing; 2019; 363, pp. 366-374. [DOI: https://dx.doi.org/10.1016/j.neucom.2019.07.052]
18. Liu, W., Pang, J. & Li, N. Research on multi-label text classification method based on tALBERT-CNN. Int. J. Comput. Intell. Syst.14(1), 201 (2021).
19. Shou, X; Gao, T; Subramanian, D; Bhattacharjya, D; Bennett, KP. Concurrent multi-label prediction in event streams. Proc. AAAI Conf. Artif. Intell.; 2023; 37,
20. Chang, W.-C., Yu, H.-F., Zhong, K., Yang, Y., & Dhillon, I. S. Taming pretrained transformers for extreme multi-label text classification. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 3163–3171 (2020).
21. Cheng, W; Hüllermeier, E. Combining instance-based learning and logistic regression for multilabel classification. Mach. Learn.; 2009; 76, pp. 211-225. [DOI: https://dx.doi.org/10.1007/s10994-009-5127-5]
22. Cui, Y; Che, W; Liu, T; Qin, B; Yang, Z. Pre-training with whole word masking for Chinese bert. IEEE/ACM Trans. Audio Speech Lang. Process.; 2021; 29, pp. 3504-3514. [DOI: https://dx.doi.org/10.1109/TASLP.2021.3124365]
23. Ji, S., Tang, L., Yu, S., & Ye, J. Extracting shared subspace for multi-label classification. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 381-389 (2008).
24. Loza Mencía, E. & Fürnkranz, J. Efficient pairwise multilabel classification for large-scale problems in the legal domain. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 50–65 (2008).
25. Mewada, A. & Dewang, R. K. SA-ASBA: A hybrid model for aspect-based sentiment analysis using synthetic attention in pre-trained language BERT model with extreme gradient boosting. J. Supercomput.79(5), (2023).
26. Trohidis, K; Tsoumakas, G; Kalliris, G; Vlahavas, IP. Multi-label classification of music into emotions. ISMIR; 2008; 8, pp. 325-330.
27. Tsoumakas, G., Katakis, I. & Vlahavas, I. Effective and efficient multilabel classification in domains with large number of labels. In Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD’08), vol. 21, 53–59 (2008).
28. Vens, C; Struyf, J; Schietgat, L; Džeroski, S; Blockeel, H. Decision trees for hierarchical multi-label classification. Mach. Learn.; 2008; 73, pp. 185-214. [DOI: https://dx.doi.org/10.1007/s10994-008-5077-3]
29. Hu, EJ et al. Lora: Low-rank adaptation of large language models. ICLR; 2022; 1,
30. Read, J., Pfahringer, B. & Holmes, G. Multi-label classification using ensembles of pruned sets. In 2008 Eighth IEEE International Conference on Data Mining, 995–1000 (2008).
31. Tahir, M. A., Kittler, J., Mikolajczyk, K., & Yan, F. Improving multilabel classification performance by using ensemble of multi-label classifiers. In Multiple Classifier Systems: 9th International Workshop, MCS 2010, Cairo, Egypt, April 7–9, 2010. Proceedings 9, 11–21 (2010).
32. Gonen, H., Ravfogel, S., Elazar, Y., & Goldberg, Y. It’s not Greek to mbert: inducing word-level translations from multilingual bert. arXiv preprint arXiv:2010.08275 (2020).
33. Zhang, L., Liu, Y., Luo, Y., Gao, F., & Gu, J. Qwen-IG: A Qwen-based instruction generation model for LLM fine-tuning. In Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition, 295-302 (2024).
34. Brown, T et al. Language models are few-shot learners. Adv. Neural. Inf. Process. Syst.; 2020; 33, pp. 1877-1901.
35. Zhou, G., Liu, Y., Yan, Z. & Gelenbe, E. Is ChatGPT trustworthy enough? a Review. IEEE Consum. Electron. Mag. (2024).
36. Liu, P et al. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv.; 2023; 55,
37. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (long and short papers), 4171–4186 (2019).
38. Krawczyk, B. Learning from imbalanced data: open challenges and future directions. Progr. Artif. Intell.; 2016; 5,
39. Liu, Z. et al.: Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, 2736–2744 (2017).
40. Jacob et al. B. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2704–2713 (2018).
41. Tan, M., & Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, 6105–6114 (2019).
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.