Full text

Turn on search term navigation

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

This paper presents a novel approach for finding the most semantically similar conversational sentences in Korean and English. Our method involves training separate embedding models for each language and using a hybrid algorithm that selects the appropriate model based on the language of the query. For the Korean model, we fine-tuned the KLUE-RoBERTa-small model using publicly available semantic textual similarity datasets and used Principal Component Analysis (PCA) to reduce the resulting embedding vectors. We also selected a highly-performing English embedding model from available SBERT models. We compared our approach to existing multilingual models using both human-generated and large language model-generated conversational datasets. Our experimental results demonstrate that our hybrid approach outperforms state-of-the-art multilingual models in terms of accuracy, elapsed time for sentence embedding, and elapsed time for finding the nearest neighbor, regardless of whether a GPU is used. These findings highlight the potential benefits of training separate embedding models for different languages, particularly for tasks involving finding the most semantically similar conversational sentences. We expect that our approach will be used for diverse natural language processing-related fields, including machine learning education.

Details

Title
Using Multiple Monolingual Models for Efficiently Embedding Korean and English Conversational Sentences
Author
Park, Youngki 1   VIAFID ORCID Logo  ; Shin, Youhyun 2 

 Department of Computer Education, Chuncheon National University of Education, Chuncheon 24328, Republic of Korea; [email protected] 
 Department of Computer Science and Engineering, Incheon National University, Incheon 22012, Republic of Korea 
First page
5771
Publication year
2023
Publication date
2023
Publisher
MDPI AG
e-ISSN
20763417
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2812438689
Copyright
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.