Content area

Abstract

Emotion classification in natural language processing (NLP) has recently witnessed significant advancements. However, class imbalance in emotion datasets remains a critical challenge, as dominant emotion categories tend to overshadow less frequent ones, leading to biased model predictions. Traditional techniques, such as undersampling and oversampling, offer partial solutions. More recently, synthetic data generation using large language models (LLMs) has emerged as a promising strategy for augmenting minority classes and improving model robustness. In this study, we investigate the impact of synthetic data augmentation on German-language emotion classification. Using an imbalanced dataset, we systematically evaluate multiple balancing strategies, including undersampling overrepresented classes and generating synthetic data for underrepresented emotions using a GPT-4–based model in a few-shot prompting setting. Beyond enhancing model performance, we conduct a detailed linguistic analysis of the synthetic samples, examining their lexical diversity, syntactic structures, and semantic coherence to determine their contribution to overall model generalization. Our results demonstrate that integrating synthetic data significantly improves classification performance, particularly for minority emotion categories, while maintaining overall model stability. However, our linguistic evaluation reveals that synthetic examples exhibit reduced lexical diversity and simplified syntactic structures, which may introduce limitations in certain real-world applications. These findings highlight both the potential and the challenges of synthetic data augmentation in emotion classification. By providing a comprehensive evaluation of balancing techniques and the linguistic properties of generated text, this study contributes to the ongoing discourse on improving NLP models for underrepresented linguistic phenomena.

Details

1009240
Title
Evaluating the Impact of Synthetic Data on Emotion Classification: A Linguistic and Structural Analysis
Author
Publication title
Volume
16
Issue
4
First page
330
Publication year
2025
Publication date
2025
Publisher
MDPI AG
Place of publication
Basel
Country of publication
Switzerland
e-ISSN
20782489
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-04-21
Milestone dates
2025-03-14 (Received); 2025-04-16 (Accepted)
Publication history
 
 
   First posting date
21 Apr 2025
ProQuest document ID
3194615642
Document URL
https://www.proquest.com/scholarly-journals/evaluating-impact-synthetic-data-on-emotion/docview/3194615642/se-2?accountid=208611
Copyright
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-11-07
Database
ProQuest One Academic