Content area
Machine learning in natural language processing (NLP) analyzes datasets to make future predictions, but developing accurate models requires large, high-quality, and balanced datasets. However, collecting such datasets, especially for low-resource languages, is time-consuming and costly. As a solution, data augmentation can be used to increase the dataset size by generating synthetic samples from existing data. This study examines the effect of translation-based data augmentation on sentiment analysis using small datasets in three diverse languages: French, German, and Japanese. We use two neural machine translation (NMT) services—Google Translate and DeepL—to generate augmented datasets through intermediate language translation. Sentiment analysis models based on Support Vector Machine (SVM) are trained on both original and augmented datasets and evaluated using accuracy, precision, recall, and F1 score. Our results demonstrate that translation augmentation significantly enhances model performance in both French and Japanese. For example, using Google Translate, model accuracy improved from 62.50% to 83.55% in Japanese (+21.05%) and from 87.66% to 90.26% in French (+2.6%). In contrast, the German dataset showed a minor improvement or decline, depending on the translator used. Google-based augmentation generally outperformed DeepL, which yielded smaller or negative gains. To evaluate cross-lingual generalization, models trained on one language were tested on datasets in the other two. Notably, a model trained on augmented German data improved its accuracy on French test data from 81.17% to 85.71% and on Japanese test data from 71.71% to 79.61%. Similarly, a model trained on augmented Japanese data improved accuracy on German test data by up to 3.4%. These findings highlight that translation-based augmentation can enhance sentiment classification and cross-language adaptability, particularly in low-resource and multilingual NLP settings.
Details
French language;
Accuracy;
Datasets;
Machine learning;
Machine translation;
Market research;
Language translation;
Intermediate languages;
Data augmentation;
Japanese language;
Sentiment analysis;
Support vector machines;
Hypotheses;
Natural language processing;
Classification;
German language;
Multilingualism;
Linguistics;
Algorithms;
Translators;
Product development;
Prediction models;
Augmentation;
Data;
Languages;
Language acquisition;
Tests;
Translation
1 Department of Information Systems, Faculty of Information Technology and Computer Science, Yarmouk University, Irbid 21163, Jordan
2 Department of Computer Information Science, Minnesota State University, Mankato, MN 56001, USA; [email protected]