Full text

Turn on search term navigation

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

This study introduces two novel data reduction approaches for efficient sentiment analysis: High-Distance Sentiment Vectors (HDSV) and Centroid Sentiment Embedding Vectors (CSEV). By leveraging embedding space characteristics from DistilBERT, HDSV selects maximally separated sample pairs, while CSEV computes representative centroids for each sentiment class. We evaluate these methods on three benchmark datasets: SST-2, Yelp, and Sentiment140. Our results demonstrate remarkable data efficiency, reducing training samples to just 100 with HDSV and two with CSEV while maintaining comparable performance to full dataset training. Notable findings include CSEV achieving 88.93% accuracy on SST-2 (compared to 90.14% with full data) and both methods showing improved cross-dataset generalization, with less than 2% accuracy drop in domain transfer tasks versus 11.94% for full dataset training. The proposed methods enable significant storage savings, with datasets compressed to less than 1% of their original size, making them particularly valuable for resource-constrained environments. Our findings advance the understanding of data requirements in sentiment analysis, demonstrating that strategically selected minimal training data can achieve robust and generalizable classification while promoting more sustainable machine learning practices.

Details

Title
Efficient Data Reduction Through Maximum-Separation Vector Selection and Centroid Embedding Representation
Author
Alshamrani Sultan  VIAFID ORCID Logo 
First page
1919
Publication year
2025
Publication date
2025
Publisher
MDPI AG
e-ISSN
20799292
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3211937607
Copyright
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.