Full Text

Turn on search term navigation

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Data augmentation is crucial for enhancing the performance of text classification models when labelled training data are scarce. For natural language processing (NLP) tasks, large language models (LLMs) are able to generate high-quality augmented data. But a fundamental understanding of the reasons for their effectiveness remains limited. This paper presents a geometric and topological perspective on textual data augmentation using LLMs. We compare the augmentation data generated by GPT-J with those generated through cosine similarity from Word2Vec and GloVe embeddings. Topological data analysis reveals that GPT-J generated data maintains label coherence. Convex hull analysis of such data represented by their two principal components shows that they lie within the spatial boundaries of the original training data. Delaunay triangulation reveals that increasing the number of augmented data points that are connected within these boundaries correlates with improved classification accuracy. These findings provide insights into the superior performance of LLMs in data augmentation. A framework for predicting the usefulness of augmentation data based on geometric properties could be formed based on these techniques.

Details

Title
Geometry of Textual Data Augmentation: Insights from Large Language Models
Author
Feng, Sherry J H  VIAFID ORCID Logo  ; Lai, Edmund M-K  VIAFID ORCID Logo  ; Li, Weihua  VIAFID ORCID Logo 
First page
3781
Publication year
2024
Publication date
2024
Publisher
MDPI AG
e-ISSN
20799292
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3110456820
Copyright
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.