A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT

Abstract

The disambiguation of word senses (Word Sense Disambiguation, WSD) plays a crucial role in various natural language processing (NLP) tasks, such as machine translation, sentiment analysis, and information retrieval. Due to the complex morphological structure and polysemy of the Korean language, the meaning of words can change depending on the context, making the WSD problem challenging. Since a single word can have multiple meanings, accurately distinguishing between them is essential for improving the performance of NLP models. Recently, large-scale pre-trained models like BERT and GPT, based on transfer learning, have shown promising results in addressing this issue. However, for languages with complex morphological structures, like Korean, the tokenization mismatch between pre-trained models and fine-tuning data prevents the rich contextual and lexical information learned by the pre-trained models from being fully utilized in downstream tasks. This paper proposes a novel method to address the tokenization mismatch issue during the fine-tuning of Korean WSD, leveraging BERT-based pre-trained models and the Sejong corpus, which has been annotated by language experts. Experimental results using various BERT-based pre-trained models and datasets from the Sejong corpus demonstrate that the proposed method improves performance by approximately 3–5% compared to existing approaches.

Details

Business indexing term

Subject:

Machine learning

Identifier / keyword

word sense disambiguation; BERT; Sejong corpus; tokenization; transfer learning

Title

A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT

Author

Jeong, Hanjo

Publication title

Mathematics; Basel

Volume

Issue

First page

864

Publication year

2025

Publication date

2025

Publisher

MDPI AG

Place of publication

Basel

Country of publication

Switzerland

Publication subject

Mathematics

e-ISSN

22277390

Source type

Scholarly Journal

Language of publication

English

Document type

Journal Article

Publication history

Online publication date

2025-03-05

Milestone dates

2025-01-24 (Received); 2025-03-03 (Accepted)

Publication history

First posting date

05 Mar 2025

DOI

https://doi.org/10.3390/math13050864

ProQuest document ID

3176338416

Document URL

https://www.proquest.com/scholarly-journals/context-preserving-tokenization-mismatch/docview/3176338416/se-2?accountid=208611

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Last updated

2025-11-07

Database

ProQuest One Academic

A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT

Content area

Abstract

Details