Content area
Developing a machine translator from a Korean dialect to a foreign language presents significant challenges due to a lack of a parallel corpus for direct dialect translation. To solve this issue, this paper proposes a pivot-based machine translation model that consists of two sub-translators. The first sub-translator is a sequence-to-sequence model with minGRU as an encoder and GRU as a decoder. It normalizes a dialect sentence into a standard sentence, and it employs alphabet-level tokenization. The other type of sub-translator is a legacy translator, such as off-the-shelf neural machine translators or LLMs, which translates the normalized standard sentence to a foreign sentence. The effectiveness of the alphabet-level tokenization and the minGRU encoder for the normalization model is demonstrated through empirical analysis. Alphabet-level tokenization is proven to be more effective for Korean dialect normalization than other widely used sub-word tokenizations. The minGRU encoder exhibits comparable performance to GRU as an encoder, and it is faster and more effective in managing longer token sequences. The pivot-based translation method is also validated through a broad range of experiments, and its effectiveness in translating Korean dialects to English, Chinese, and Japanese is demonstrated empirically.
Full text
1. Introduction
Most languages have their own dialects derived from social, ethnic, and regional causes. The Korean language also has eight major dialects according to the work of Choi [1]. They are Gyeongi, Gangwon, Chungcheong, Jeolla, Gyeongsang, Jeju, Pyeongan, and Hamgyeong, where the Gyeongi dialect is regarded as standard Korean. The main characteristic of Korean dialects is that they are specialized geographically, but each region in which a dialect is spoken is not large. Thus, all the dialects share the same grammar, while every dialect has its own specialized words and some morphological transitions.
The specialized words in a dialect make it difficult to directly translate the dialect to a foreign language. Figure 1 shows such an example. The standard sentence “이
The aforementioned OOV issue in managing Korean dialects is mainly caused by modern sub-word tokenizers which are naively applied to Korean at the syllable level. This is the point where the first research question arises: ‘Are there any better options for normalizing Korean dialects than simply applying syllable level sub-word tokenizers and a transformer architecture?’ We found some fundamental insight that Korean dialects exhibit unique phonetic and morphological characteristics that are not adequately captured by syllable-level tokenization methods: for instance, the verb ‘카다 (do)’ in the Gyeongsang dialect, which is pronounced as ka-da. The only difference in this verb from its standard form ‘하다’, pronounced as ha-da, is the first syllable ‘ka’ instead of ‘ha’. We suggested such examples in Table 1. Nevertheless, both words are represented differently by a sub-word tokenizer. According to Li et al. [2], character-level modeling is better than sub-word modeling in agglutinative languages especially when a translator is trained with a small dataset. Since Korean is an agglutinative language and its alphabet shares many features with English characters, the proposed normalization model performs at the Korean alphabet—Jamo—level. This is the answer to our first research question.
Then, the second research question of this paper arises: ‘Is there any way to apply a novel dialect normalization method to more practical fields?’ We found the answer: ‘by translating Korean dialects to foreign languages’. Korean dialects are spoken rather than written, but the needs for translating written dialects are increasing. Nowadays, commercial speech-to-text and speech-to-speech translators are emerging in translation fields. In such cases, spoken dialects are transcribed into text before translation. The current challenge in the field is a lack of available data to train a machine translator from a Korean dialect to a foreign language. That is, there is no publicly available parallel corpus for Korean dialect translation. Thus, the proposed model employs a pivot-based approach which adopts standard Korean as a pivot language. The pivot-based approach is a common choice in many low-resourced languages [3,4,5]. The normalization from a dialect to a standard sentence can be achieved by the proposed alphabet-level translation model. Once a dialect sentence is normalized to a standard sentence, the standard sentence can be translated to a foreign sentence by a legacy translator such as off-the-shelf neural machine translators or large language models.
The contributions of this paper can be summarized as follows: This is the first work, to the best of our knowledge, to translate a Korean dialect to a foreign language with a pivot-based translation model. According to the experimental results, the proposed pivot-based translation outperforms the direct translation. Alphabet-level tokenization is used to normalize dialect sentences, and its superiority to other sub-word tokenizations is shown empirically. The proposed sequence-to-sequence model for dialect normalization adopts minGRU as an encoder and GRU as a decoder. This paper proves that minGRU as an encoder can be an alternative option to GRU as an encoder, since minGRU is faster and more effective in managing longer token sequences than GRU.
The rest of this paper is organized as follows. Section 2 surveys previous works on dialects and their translations. Section 3 explains how the proposed pivot-based Korean dialect translator is structured, and Section 4 describes normalizing a dialect sentence to a standard form in detail. Section 5 presents the evaluation results, and finally Section 6 draws conclusions of this work.
2. Related Work
Dialect translations are clustered into two categories: inner dialect translation and dialect foreign translation. Inner dialect translation is targeting the translation between dialects of a language. It includes the translation of a non-standard dialect into a standard one, which is known as dialect normalization [6]. On the other hand, dialect foreign translation focuses on translation between a dialect and a foreign language.
One critical issue in both dialect translations is to cope with lexical variations. Tan et al. [7] proposed Base-Inflectional Encoding (BITE), which can be applied to any pre-trained language model with ease. It leverages inflectional features of English and thus is robust even in non-standard English. Abe et al. [8] tried to capture consistent phonetic transformation rules shared by various Japanese dialects. Thus, they used a multilingual NMT [9] to translate a dialect into standard Japanese. On the other hand, Honnet et al. [10] and Sajjad et al. [11] empirically proved that a character-level processing is effective in managing variations of Swiss German and Egyptian Arabic, respectively.
Another critical issue in dialect translation is the lack of resources for training translation models. Faheem et al. [12] applied a semi-supervised approach to normalize Egyptian Arabic to Standard Arabic in order to overcome the lack of training data. On the other hand, Liu et al. [13] prepared a dataset for direct dialect translation. They created a parallel corpus from Singlish to Standard English. This stresses on lexical-level normalization, syntactic-level editing, and semantic-level rewriting. When a machine translator is trained with limited data, it is prone to be excessively affected by noise or superficial lexical features. Therefore, input perturbations were adopted at both the word and sentence levels.
In dialect foreign translation, a pivot-based translation is a common approach to circumvent the low-resource problem [14,15]. In this approach, a dialect is first translated into a standard form, and then the standard form is translated again into a foreign language. In addition to this approach, back translation is often adopted to solve the issue of limited dialect data [16,17]. For instance, Tahssin et al. [18] applied back translation to overcome data imbalance. At the same time, there have been some efforts to construct data for the direct translation of dialects. Riley et al. [19] presented a benchmark for few-shot region-aware machine translation. This benchmark includes language pairs of English and two regional dialects of Portuguese and Mandarin Chinese. On the other hand, Sun et al. [20] proposed a translation evaluation method which is robust to dialects.
Recent studies on the translation of Korean regional dialects have mainly focused on exploring and improving existing neural machine translation methods. Lim et al. [21] adopted a transformer-based architecture and a syllable-level SentencePiece tokenizer for Korean dialect translation. They also confirmed the effectiveness of the copy mechanism and the many-to-one translation approach. Hwang and Yang [22] took a pre-training and fine-tuning approach in Korean normalization. They fine-tuned a BART variant using standard BPE tokenization and regional information tokens. Similarly, Lee et al. [23] demonstrated the potential of large language models (LLMs) as a translator for the Jeju dialect.
Korean dialects are used more frequently in speech than in writing. Therefore, the research on speech dialect recognition is essential, yet this topic has only been addressed by a few studies. Roh and Lee [24] showed an early exploration of this topic. The experiments were performed to investigate how commercial APIs could be used to recognize Korean dialects. According to their study, the Google Speech Recognition API is more accurate than other APIs. However, despite its high accuracy, challenges remain in recognizing dialects due to the unique phonetic and lexical characteristics of the Korean language. Na et al. [25] suggested an insight regarding how off-the-shelf ASR systems can be adapted for dialect recognition. Their experimental results showed that the modern ASR systems such as Whisper and wav2vec 2.0 perform well in recognizing Korean dialects. More recently, Bak et al. [26] improved the performance of Whisper’s dialect recognition by refining its results using GPT-4o-mini. In this study, transcription errors in Whisper were corrected by applying RAG to the GPT-4o-mini language model.
3. Pivot-Based Translation for Korean Dialects
Due to a lack of parallel corpus between Korean dialects and foreign languages, it is extremely difficult to construct a direct machine translator for Korean dialects. Thus, this paper adopts standard Korean as a pivot language between a dialect and a foreign language. That is, the proposed translator first translates a dialect sentence into a standard Korean sentence and then translates that into a foreign language.
Figure 2 depicts the overall structure of the proposed pivot-based translator for Korean dialects. It consists of two sub-translators. A dialect sentence is first normalized to a standard sentence by a GRU-based sequence-to-sequence model explained below. For instance, legacy machine translators do not understand a Jeju dialect sentence “목사님 그 앞에 모니터 좀 있으면 좋지 않안허쿠가 영 하니까,” where this dialect sentence is translated incorrectly into “The pastor, it would be nice to have a monitor in front of him because he’s so young.” However, they accept its standard sentence “목사님 그 앞에 모니터 좀 있으면 좋지 않겠어요 이렇게 하니까,” of which the meaning is “Pastor, wouldn’t it better if there is a monitor in front of the stuff like this?”
Once a standard sentence has been obtained, it can be translated again into a foreign language by a legacy machine translator. Due to the large volume of parallel corpora between standard Korean and major foreign languages, a number of machine translators including LLMs show high and reliable performance. This paper leverages neural machine translators from easyNMT (
4. Normalization Model from Dialect to Standard
4.1. Tokenization
There are several options for tokenizing a Korean text. Four widely used tokenizations among them are syllable-level SentencePiece, byte-level BPE, morpheme-level, and alphabet-level. The tokens by these tokenizations for the example dialect sentence used in Figure 2 are shown in Table 2. Korean is a syllabary language. Thus, a syllable can be a natural tokenization unit. However, since a syllable is a combination of several base alphabets in Korean, the number of possible syllables is extremely large. As a result, syllable-level tokens such as syllable-level SentencePiece trained with a insufficient corpus can result in a poor performance. Byte-level BPE is effective because it does not suffer from the OOV problem, as it processes Korean texts at the byte level. However, it generates illegible output which is far from the original sentence as shown in this table. Therefore, this method is not intuitive.
Another option is to leverage morphemes as the unit of tokenization. This is standard practice in both rule-based and statistics-based machine translation. Since a morpheme is the smallest unit of meaning, this tokenization can preserve the meaning of each word. However, it depends on the performance of a morphological analyzer and also suffers from the OOV problem.
The last option for Korean tokenization is to use the Korean alphabet, Jamo. This paper proposes alphabet-level tokenization for dialect normalization. The proposed tokenizer uses Korean alphabets, alpha-numeric symbols, and two special tokens of ‘
4.2. Normalization Model
A Korean dialect sentence and its standard sentence share most alphabet sequences. Therefore, the Gated Recurrent Unit (GRU) model proposed by Cho et al. [28] can be used for dialect normalization. It has shown reasonable performances in many NLP tasks and is more efficient than LSTM since it has fewer parameters.
The proposed model for dialect normalization is depicted in Figure 3. It is a GRU-based sequence-to-sequence model. However, its encoder is minGRU [29] rather than GRU since GRU suffers from slow convergence by back propagation-through-time, and alphabet-level token sequences are quite long in general, as shown in Table 2. On the other hand, its decoder is an original GRU. This is because the decoder aims at the precise generation of a target sentence in an auto-regressive way.
Assume that a natural language dialect sentence is given. If the t-th alphabet-level token is expressed as a vector embedding , then the sentence is represented as a matrix by concatenating values. The encoder is a bi-directional multi-layer minGRU. That is, it consists of L minGRU layers. In the l-th layer (), a minGRU transforms into a hidden state vector . In order to speed up training, minGRU removes hidden state dependencies of GRU and reduces hyperbolic functions. That is, when ⊙ is a point-wise multiplication, is obtained by
(1)
where(2)
(3)
Here, and denote a sigmoid activation function and a d-dimensional linear transformation, respectively. Note that there is no reset gate in these equations. In addition, the forget gate of GRU represented as is replaced with . Compared with , has no dependency on the previous hidden state . Similarly, also does not depend on . As a result, and can be processed in parallel for all t values. After and are computed for all t values, the hidden state in Equation (1) is obtained in parallel using the Parallel Scan algorithm [30,31]. The final output of the l-th layer becomes . Equations (2) and () depend on , which is the hidden state of the -th layer. Thus, computing can be understood as where .In order to reflect bi-directional contexts to the hidden state, both and are used. That is, is used as a final hidden state matrix, where is and . Here, is a concatenation of two vectors. Then, the final output of the encoder is , which is the hidden state matrix of the L-th bi-directional minGRU layer.
The decoder of the proposed translator is a GRU which generates auto-regressively. It applies the dot-product attention [32] between , the hidden state of the decoder, and , the last hidden state of the encoder, to generate an output. That is, in generating , the attention score is first calculated by
Then, the context vector becomes
Since the decoder is a GRU, the hidden state vector of the decoder becomes
(4)
where is an embedding of . Finally, when V is a vocabulary, , the t-th output, is generated byIn these equations, is a vector whose dimension is , and it is converted to the output token through one−hot operation and vocabulary lookup. In Figure 3, ‘ㅗ’ is generated through this process and its embedding, , is fed to the generation of .
5. Experiments
5.1. Experimental Settings
The Korean Dialect Speech Dataset released at the Korea AI Hub (
Since the sentences in the dataset are spoken dialects, they are pre-processed to remove their noise such as stutters and laughter. In addition, sentences that are too short, with fewer than four eojeols, are also removed. An eojeol is a spacing unit in Korean. The original dataset contains a large portion of pairs in which the standard and dialect sentences are exactly the same. This is because the dataset comprises spoken dialogues. If speakers conversed in standard Korean, the standard and dialect forms are labeled as the same. Thus, all such cases are filtered out of the dataset. A simple statistic on the final pre-processed dataset is provided in Table 3. The number in the parenthesis indicates the number of original pairs. The ratio of dialect pairs in the training set varies considerably from 12% to 35%. Nevertheless, there are enough pairs to train and test a model for normalizing Korean dialects.
Four kinds of tokenizations are evaluated for normalizing Korean dialects: syllable-level SentencePiece, byte-level BPE, morpheme level, and alphabet level. SentencePiece tokenization is implemented with the SentencePiece module from GitHub (v. 0.2.0) and BPE is implemented with the ByteLevelBPETokenizer class from the tokenizers module of HuggingFace. Korean morphemes are analyzed by the MeCab-ko (
The normalization models are trained to optimize the cross-entropy loss with the Adam optimizer and the ReduceLROnPlateau scheduler. The batch size was set to 200 for all tokenization and dialects except for morpheme-level tokens. As the vocabulary size of morpheme-level tokens is larger than others, the batch size for these was set to 64. The learning rate was initialized at 5 × . The weight decay is set to 1 × . Additionally, b in Equation (1), the dimension of the hidden state vectors of an encoder, is set as 128. Thus, that of a decoder is 256. The number of layers, L, is set to three.
The translation results are evaluated with ChrF++, BLEU, and BERT score. Among the metrics, BLEU and ChrF++ are used to assess the normalization results. Note that phonemic transitions are predominantly observed in Korean dialects. Thus, the primary work of the normalization models is to reconstruct phonemic varieties to restore the standard form. This is why n-gram-based metrics are useful for evaluating dialect normalization models. In particular, ChrF++ is designed for character-level evaluation, while BLEU focuses at the word level. On the other hand, the results of pivot translation are evaluated with ChrF++ and BERT score. Since this task is an ordinary translation task, the BERT score is adopted to evaluate the semantic similarity between original and translated sentences.
The translation models used to translate normalized dialects into foreign languages are (i) Opus-MT [33], (ii) m2m_100_1.2B [34], and (iii) EXAONE-3.0-7.8B-Instruct [27]. Llama-3.1-8B-Instruct is used to generate reference translations for pivot translation experiments. The Llama model is adopted since it is one of the most popular publicly available LLMs. Opus-MT and m2m_100_1.2B are neural machine translators. Opus-MT is designed based on Marinan NMT [35] and is trained using OPUS datasets [36]. On the other hand, m2m_100_1.2B is a many-to-many translation model. It can translate any pair of the one hundred languages it has been trained on. Their trained checkpoints are loaded and used by easyNMT, where easyNMT is a Python library (v. 2.0.2) for neural machine translation which provides a unified interface to various translation models. Two neural translation models have been selected as a baseline because they are effective and can be easily adapted for practical use through easyNMT. Exaone is a large language model (LLM) developed by LG AI Research. According to Sim et al. [37], it is specialized in understanding Korean culture, which implies that it is better at processing Korean dialects. This is the core reason why Exaone was adopted for this experiment. The LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct checkpoint from HuggingFace is used in the experiments. Translations into English, Japanese, and Chinese are evaluated for m2m_100_1.2B and Exaone. However, only the English translation is reported for Opus-MT because the only available Opus-MT checkpoint is from Korean to English.
Two LLMs, Llama-3.1-8B-Instruct and Exaone-3.0-7.8B-Instruct, are used in the experiments. The prompts for each model are designed to elicit the desired translation outcomes. The prompt examples for each model are shown in Figure 4. The prompts are given in a zero-shot setting with no additional fine tuning. The prompts commonly include “다음 문장을 영어로 번역해줘 (Please translate the following sentence into English)”. For other languages, the term ‘영어 (English)’ is replaced with ‘일본어 (Japanese)’ or ‘중국어 (Chinese)’. The prompts are written in Korean for both models, forcing the language model to focus more on the Korean translation task. On the other hand, instruction for their output is in two forms: “You should generate only the translated text” for Llama-3.1-8B-Instruct and “번역한 문장만 출력하도록 해 (Please output only the translated sentence)” for Exaone-3.0-7.8B-Instruct. This is because Llama-3.1-8B-Instruct was primarily trained on English data, whereas Exaone-3.0-7.8B-Instruct was specialized for Korean.
Normalization experiments were performed three times with randomly initialized weights. The table states the average and standard deviation of the evaluation metrics across these three runs. This is crucial for evaluating the efficiency of each tokenization methods. Conversely, pivot translation experiments are conducted once for each translation model. This is because the translation models are pre-trained and their weights are fixed. The normalization model used in the pivot translation experiments is the model that performed best in the normalization experiments.
5.2. Evaluations on Normalization from Dialect to Standard
Table 4 and Table 5 compare the performance of tokenization methods when normalizing Korean dialects. According to these tables, alphabet-level tokenization outperforms all other methods. Its average chrF++ score is over 90, implying that it restores almost perfect standard sentences from dialect sentences. Alphabet-level tokenization has a statistically significant advantage over other tokenization methods except for the Jeju dialect. A similar tendency is observed when BLEU is used as an evaluation metric. This is because the Korean dialects share their grammar and most words with standard Korean, as shown in Table 2.
One thing to note about these tables is that sub-words are ineffective for this task. They are heavily influenced by the initial weight of the model. The standard deviations of the sub-word tokenizers are higher than those of alphabet-level tokenization. SentencePiece and BPE are more complex and restrictive than the others. They model the natural language within a pre-defined vocabulary size and require a larger corpus to capture all the necessary patterns for the task. The vocabulary size in sub-word tokenization is set to 30,000, which is smaller than that in morpheme-level tokenization. The difference in vocabulary size is clearly evident in the performance gap between sub-word tokenization and morpheme-level tokenization.
Another thing to note is that the performance of the Jeju dialect is consistently lower than that of other dialects, whichever tokenization is used. That is, even though the Jeju dialect has the largest number of training instances, its performance is the worst. This is due to the geographical characteristics of the Jeju area. Jeju is an isolated island located far south of Seoul. Thus, the Jeju dialect differs significantly from standard Korean primarily at its surface form, resulting in poor normalization performance. This is also the reason why the sub-word tokenizers perform better with the Jeju dialect than with other dialects. No statistical significance was observed between sub-word tokenization and alphabet-level tokenization in the Jeju dialect.
The size of the proposed normalization model with alphabet-level tokenization is much smaller than the model with sub-word tokenizers. The proposed model with alphabet-level tokenization has 1.2M trainable parameters for 156 Korean alphabets including numbers and symbols. In constrast, the model with syllable-level SentencePiece has 21M parameters. That is, the alphabet-level tokenization achieves a higher performance with much fewer parameters.
The proposed model adopts minGRU as its encoder because the length of an input sequence becomes longer due to its alphabet-level tokenization. However, GRU can also be used as an encoder. According to Table 6, the normalizing performance of using a bidirectional GRU is slightly better than that of using a bidirectional minGRU. However, this difference is not statistically significant with . The true advantage of minGRU lies in its efficiency during its training and inference. That is, its execution time is faster than that of GRU. Figure 5 depicts how much faster minGRU is than GRU during the training. MinGRU takes less time to execute each epoch than GRU for all types of dialect. Overall, using a minGRU encoder saves about 15% of epoch time, even though the minGRU encoder consists of three minGRU layers and the GRU encoder has only one GRU layer. The three-layered minGRU and the one-layered GRU are compared because they demonstrate similar performance in character-level tokenization.
Figure 6 compares minGRU and GRU regarding the GPU processing time used for normalizing dialects to the standard. In this figure, X-axis is the number of tokens in a dialect sentence, and Y-axis represents the GPU time (msec) consumed to normalize a dialect sentence. This figure proves that minGRU always consumes less GPU time than GRU. The difference between minGRU and GRU is not significant when the sentence length is less than 400. However, the longer a dialect sentence is, the larger time gain minGRU has.
5.3. Evaluations on Pivot Translation
There is no parallel corpus for Korean dialects and foreign languages. Thus, a parallel corpus has been constructed from the normalization dataset. Note that the normalization dataset contains a standard Korean sentence for each dialect sentence. The standard sentences of the test set are first translated into foreign languages by an LLM, Llama-3.1-8B-Instruct, under the assumption that the translated sentences in this way are correct. Although the model has limited language modeling capability, it was chosen due to resource constraints. Then, three translation models—m2m_100_1.2B, Opus-MT and Exaone—are used to prepare pairs of a dialect sentence and its translated foreign equivalent as well as pairs of a normalized sentence and its translated foreign equivalent.
Table 7 shows the evaluation results of the proposed pivot-based translation model for English. Here, ‘Direct Translation’ means that the dialects are translated directly into English without using a pivot standard. This table reports the ChrF++ and BERT score for pivot-based dialect translations. The proposed model achieves better performances than direct translation for all dialects, proving the effectiveness of using standard Korean as a pivot language. Although the improvement in performance is modest for both ChrF++ and BERT score, it is still significant. The ChrF++ score indicates the surface form distance, but this is not the only factor that determines the translation quality. The BERT score measures the semantic similarity between two sentences. However, according to Hanna and Bojar [38], the BERT score often assigns a high score even to incorrect translations. In the experiment of Hanna and Bojar [38], sentences containing several grammatical errors achieved a BERT score of around 82, while grammatically correct sentences achieved a score of around 83. The BERT score assigns a high score to incorrect translations, but it definitely penalizes defective translations. This shows that an improvement in the BERT score, even by less than 1.0 points, still implies a certain amount of improvement in the semantic level. In summary, the table shows that the proposed pivot-based translation is better in terms of both surface form and semantics.
Table 8 and Table 9 demonstrate performance when translating dialects into Chinese and Japanese, respectively. Since neither language has word spacing, ChrF is used instead of ChrF++. Unlike Table 7, these tables do not include the Opus-MT’s performance, as there is no checkpoint for translating Korean to Chinese and Japanese. A similar phenomenon to that observed for English is seen for Chinese and Japanese, too. The proposed pivot-based translation consistently outperforms direct translation in these languages.
Exaone is a general-purpose LLM, while m2m_100_1.2B is a specialized neural machine translation model. It is important to note that both models outperform direct translation when the proposed pivot-based approach is used. This tendency is observed in all three foreign languages. This demonstrates the robustness and effectiveness of the proposed pivot-based translation approach with character-level tokenization for dialect normalization.
6. Conclusions
In this paper, we propose a pivot-based translation model for translating Korean dialects into foreign languages. To overcome a lack of a parallel corpus for direct translation from a dialect to a foreign language, our model first normalizes the dialect sentence to a standard sentence and then translates the standard sentence into a foreign language. The dialect normalization model is a standard GRU-based sequence-to-sequence model, using minGRU as an encoder and GRU as a decoder. As the model adopts alphabet-level tokenization, the input sentence tends to be a long sequence of tokens. To address this issue, a multi-layer minGRU is adopted as the encoder instead of a GRU. A legacy translator is then used for the translation between standard and foreign sentences. In this paper, two neural translators (Opus-MT and m2m_100_1.2B) and an LLM, Exaone, have been tested for this purpose.
Experiments on the Korean Dialect Speech Dataset demonstrate that alphabet-level tokenization achieves higher performance than sub-word tokenization and morpheme-level tokenization. Furthermore, minGRU is shown to be a more effective encoder than GRU for dialect normalization. In addition, it was also shown that the proposed pivot-based translation is superior to the direct translation when translating Korean dialects to English, Chinese, and Japanese.
Conceptualization, J.P.; methodology, S.-B.P.; software, J.P.; validation, S.-B.P.; formal analysis, J.P.; investigation, J.P.; resources, S.-B.P.; data curation, J.P.; writing—original draft preparation, J.P.; writing—review and editing, S.-B.P.; visualization, J.P.; supervision, S.-B.P.; funding acquisition, S.-B.P. All authors have read and agreed to the published version of the manuscript.
The Data used in this paper are published in
The authors declare no conflicts of interest.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 1 An example translation of a Gangwon dialect sentence to English. The input Korean sentence is “이 나이에도 여전히 걱정만 끼쳐 죄송해요.” Its English translation is “I’m sorry I still worry you even at this age.” but the dialect sentence is mistranslated to “I’m sorry that I still cause extreme pain even in this situation.”
Figure 2 The overall structure of the proposed pivot-based machine translator for Korean dialects. The red fonts indicates the normalized part of given sentence.
Figure 3 The architecture of the proposed translator from Korean regional dialects to the standard.
Figure 4 Examples of prompts for Llama-3.1-8B and Exaone-3.0-7.8B.
Figure 5 Epoch time comparison between bi-directional minGRU and bi-directional GRU.
Figure 6 Change of GPU execution time per a sentene according to token length in dialect normalization.
Example specialized words in Korean dialects.
| Dialect | Standard | Dialect | Meaning |
|---|---|---|---|
| Gyeongsang | 하다 [hada] | 카다 [kada] | do |
| Jeolla | 버르장머리 [pʌrɯdzaŋmʌri] | 버르쟁이 [pʌrɯdzɛŋi] | courtesy |
| Jeju | 있었어? [iśʌśʌ] | 있언? [iśʌn] | was it? |
| Gangwon | 고등학교 [kodɯŋhakkyo] | 고등핵교 [kodɯŋhɛkkyo] | high school |
| Chungcheong | 어떻게 [ʌt́ʌkhe] | 어트케 [ʌdhɯkhe] | how |
An example of tokenization for a Korean dialect sentence. ‘##’ in morpheme-level tokens and ‘▁’ in SentencePiece tokens represent white space information.
| Tokenization Method | Tokens |
|---|---|
| dialect sentence | 목사님 그 앞에 모니터 좀 있으면 좋지 않안허쿠가 영 하니까 |
| standard sentence | 목사님 그 앞에 모니터 좀 있으면 좋지 않겠어요 이렇게 하니까 |
| SentencePiece | ▁목사님 ▁그 ▁앞에 ▁모니터 ▁좀 ▁있으면 ▁좋지 ▁않안 허 쿠가 |
| ▁영 ▁하니까 | |
| byte-level BPE | 목 ìĤ¬ëĭĺ Ġê·¸ ĠìķŀìĹIJ Ġ모ëĭĪ íĦ° Ġì¢Ģ ĠìŀĪìľ¼ë©´ Ġì¢ĭì§Ģ ĠìķĬìķĪ íĹĪ |
| ì¿łê°Ģ Ġìĺģ ĠíķĺëĭĪê¹Į | |
| morpheme-level | 목사 ##님 그 앞 ##에 모니터 좀 있 ##으면 좋 ##지 않 ##안 ##허 ##쿠 |
| ##가 영 하 ##니까 | |
| ㅁ ㅗ ㄱ | |
| ㅇ ㅏ ㅍ | |
| alphabet-level | ㅁ ㅕ ㄴ |
| ㅇ ㅕ ㅇ |
A simple statistic of the dataset for Korean dialect normalization.
| Dialect | # of Training Pairs | # of Validation Pairs | # of Test Pairs |
|---|---|---|---|
| Gyeongsang | 260,494 (2,088,717) | 14,210 (89,512) | 14,181 (89,511) |
| Jeolla | 254,207 (1,992,101) | 25,922 (110,458) | 25,727 (110,459) |
| Jeju | 758,384 (2,774,257) | 42,938 (80,062) | 42,669 (80,061) |
| Gangwon | 557,969 (1,573,237) | 24,203 (91,346) | 24,084 (91,345) |
| Chungcheong | 260,494 (1,848,455) | 15,434 (95,000) | 15,601 (95,000) |
Korean dialect normalization results evaluated on BLEU. * Means that the difference is statistically significant (p < 0.05) compared to the alphabet-level tokenization. The bold values indicate the best performance in each column.
| Tokenizations | Gyeongsang | Jeolla | Jeju | Gangwon | Chungcheong | Overall |
|---|---|---|---|---|---|---|
| SentencePiece | | 53.82 | ||||
| byte-level BPE | | 40.84 | ||||
| morpheme-level | 85.38 | |||||
| alphabset-level | | | | | | 94.88 |
Korean dialect normalization results evaluated on chrF++. * Implies statistical significance (p < 0.05) over the alphabet-level tokenization. The bold values indicate the best performance in each column.
| Tokenizations | Gyeongsang | Jeolla | Jeju | Gangwon | Chungcheong | Overall |
|---|---|---|---|---|---|---|
| SentencePiece | | 43.64 | ||||
| byte-level BPE | | 31.3 | ||||
| morpheme-level | 78.42 | |||||
| alphabset-level | | | | | | 91.56 |
Performance comparison according to encoder types: GRU vs. minGRU.
| Dialect | BLEU | chrF++ | ||
|---|---|---|---|---|
| minGRU | GRU | minGRU | GRU | |
| Gyeongsang | | | | |
| Jeolla | | | | |
| Jeju | | | | |
| Gangwon | | | | |
| Chungcheong | | | | |
| Overall | 94.88 | 95.44 | 91.56 | 92.50 |
Evaluations on Korean dialect translation to English. In this table, chr./B. means chrF++/BERT score.
| Dialect | Direct Translation | Proposed Model | ||||
|---|---|---|---|---|---|---|
| Opus-MT | m2m_100_1.2B | Exaone | Opus-MT | m2m_100_1.2B | Exaone | |
| chr./B. | chr./B. | chr./B. | chr./B. | chr./B. | chr./B. | |
| Gyeongsang | 27.82/89.41 | 27.64/89.62 | 39.80/91.23 | 29.19/89.65 | 29.17/89.90 | 40.45/91.26 |
| Jeolla | 27.86/89.21 | 27.66/89.37 | 39.81/90.95 | 28.96/89.48 | 29.27/89.75 | 40.37/91.05 |
| Jeju | 23.30/88.15 | 22.97/88.18 | 33.86/89.85 | 26.06/88.90 | 25.94/89.20 | 36.75/90.51 |
| Gangwon | 25.95/88.92 | 25.61/89.15 | 37.18/90.56 | 27.83/89.45 | 27.43/89.62 | 38.36/90.77 |
| Chungcheong | 28.06/89.22 | 28.12/89.40 | 39.75/90.93 | 29.10/89.45 | 29.54/89.70 | 40.22/90.98 |
| Overall | 26.60/88.98 | 26.40/89.16 | 38.08/90.70 | 28.23/89.39 | 28.27/89.63 | 39.05/90.91 |
Evaluations on translation from Korean dialects to Chinese. In this table, BERT means BERT score.
| Dialect | Direct Translation | Proposed Model | ||
|---|---|---|---|---|
| m2m_100_1.2B | Exaone | m2m_100_1.2B | Exaone | |
| chrF/BERT | chrF/BERT | chrF/BERT | chrF/BERT | |
| Gyeongsang | 9.03/69.28 | 13.16/73.92 | 9.69/70.37 | 13.31/74.05 |
| Jeolla | 9.01/69.35 | 12.71/73.65 | 9.67/70.55 | 12.94/73.90 |
| Jeju | 7.62/67.19 | 10.82/71.02 | 9.02/69.92 | 12.09/73.10 |
| Gangwon | 8.18/68.47 | 12.01/72.86 | 9.12/70.16 | 12.50/73.58 |
| Chungcheong | 8.74/69.44 | 12.33/73.66 | 9.37/70.40 | 12.60/73.90 |
| Overall | 8.52/68.75 | 12.21/73.02 | 9.37/70.28 | 12.69/73.71 |
Evaluations on translation from Korean dialects to Japanese. In this table, BERT means BERT score.
| Dialect | Direct Translation | Proposed Model | ||
|---|---|---|---|---|
| m2m_100_1.2B | Exaone | m2m_100_1.2B | Exaone | |
| chrF/BERT | chrF/BERT | chrF/BERT | chrF/BERT | |
| Gyeongsang | 12.94/74.18 | 19.69/78.59 | 13.83/75.01 | 20.67/79.07 |
| Jeolla | 13.03/74.18 | 19.44/78.48 | 13.78/75.00 | 19.92/78.81 |
| Jeju | 9.38/71.71 | 15.42/76.20 | 11.02/73.35 | 17.05/77.47 |
| Gangwon | 10.99/73.12 | 17.92/77.79 | 12.09/74.26 | 18.60/78.31 |
| Chungcheong | 12.71/73.90 | 19.08/78.21 | 13.51/74.60 | 19.70/78.51 |
| Overall | 11.81/73.42 | 18.31/77.85 | 12.84/74.44 | 19.19/78.39 |
1. Choi, M.O. Korean Dialects; Sechang: Seoul, Republic of Korea, 1995.
2. Li, J.; Shen, Y.; Huang, S.; Dai, X.; Chen, J. When is char better than subword: A systematic study of segmentation algorithms for neural machine translation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing; Online, 1–6 August 2021; pp. 543-549.
3. Chen, Y.; Liu, Y.; Cheng, Y.; Li, V. A Teacher-Student Framework for Zero-Resource Neural Machine Translation. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics; Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1925-1935.
4. Firat, O.; Sankaran, B.; Al-Onaizan, Y.; Vural, F.; Cho, K.H. Zero-Resource Translation with Multi-lingual Neural Machine Translation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing; Austin, TX, USA, 1–5 November 2016; pp. 268-277.
5. Kim, Y.S.; Petrov, P.; Petrushkov, P.; Ney, H. Pivot-based Transfer Learning for Neural Machine Translation between Non-English Languages. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing; Hong Kong, China, 3–7 November 2019.
6. Kuparinen, O.; Miletić, A.; Scherrer, Y. Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023; Singapore, 6–10 December 2023; pp. 13814-13828.
7. Tan, S.; Joty, S.; Varshney, L.; Kan, M.Y. Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing; Online, 16–20 November 2020; pp. 5647-5663. [DOI: https://dx.doi.org/10.18653/v1/2020.emnlp-main.455]
8. Abe, K.; Matsubayashi, Y.; Okazaki, N.; Inui, K. Multi-dialect Neural Machine Translation and Dialectometry. Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation; Hong Kong, China, 1–3 December 2018.
9. Johnson, M.; Schuster, M.; Le, Q.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viégas, F.; Wattenberg, M.; Corrado, G.
10. Honnet, P.E.; Popescu-Belis, A.; Musat, C.; Baeriswyl, M. Machine Translation of Low-Resource Spoken Dialects: Strategies for Normalizing Swiss German. Proceedings of the 11th International Conference on Language Resources and Evaluation; Miyazaki, Japan, 7–12 May 2018.
11. Sajjad, H.; Darwish, K.; Belinkov, Y. Translating Dialectual Arabic to English. Proceedings of the 51th Annual Meeting of the Association for Computational Linguistics; Sofia, Bulgaria, 4–9 August 2013; pp. 1-6.
12. Faheem, M.; Wassif, K.; Bayomi, H.; Abdou, S. Improving neural machine translation for low resource languages through non-parallel corpora: A case study of Egyptian dialect to modern standard Arabic translation. Sci. Rep.; 2024; 14, 2265. [DOI: https://dx.doi.org/10.1038/s41598-023-51090-4] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38280911]
13. Liu, Z.; Ni, S.; Aw, A.; Chen, N. Singlish Message Paraphrasing: A Joint Task of Creole Translation and Text Normalization. Proceedings of the 29th International Conference on Computational Linguistics; Gyeongju, Republic of Korea, 12–17 October 2022; pp. 3924-3936.
14. Paul, M.; Finch, A.; Dixon, P.; Sumita, E. Dialect Translation: Integrating Bayesian Co-segmentation Models with Pivot-based SMT. Proceedings of the 1st Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties; Edinburgh, UK, 31 July 2011; pp. 1-9.
15. Jeblee, S.; Feely, W.; Bouamor, H.; Lavie, A.; Habash, N.; Oflazer, K. Domain and Dialect Adaptation for Machine Translation into Egyptian Arabic. Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing; Doha, Qatar, 25–29 October 2014; pp. 196-206.
16. Edunov, S.; Ott, M.; Auli, M.; Grangier, D. Understanding Back-Translation at Scale. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing; Brussels, Belgium, 31 October–4 November 2018; pp. 489-500. [DOI: https://dx.doi.org/10.18653/v1/D18-1045]
17. Sennrich, R.; Haddow, B.; Birch, A. Improving Neural Machine Translation Models with Monolingual Data. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics; Berlin, Germany, 7–12 August 2016; pp. 86-96. [DOI: https://dx.doi.org/10.18653/v1/P16-1009]
18. Tahssin, R.; Kishk, Y.; Torki, M. Identifying Nuanced Dialect for Arabic Tweets with Deep Learning and Reverse Translation Corpus Extension System. Proceedings of the 5th Arabic Natural Language Processing Workshop; Barcelona, Spain, 12 December 2020; pp. 288-294.
19. Riley, P.; Dozat, T.; Botha, J.; Garcia, X.; Garrette, D.; Riesa, J.; Firat, O.; Constant, N. FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation. Trans. Assoc. Comput. Linguist.; 2023; 11, pp. 671-685. [DOI: https://dx.doi.org/10.1162/tacl_a_00568]
20. Sun, J.; Sellam, T.; Clark, E.; Vu, T.; Dozat, T.; Garrette, D.; Siddhant, A.; Eisenstein, J.; Gehrmann, S. Dialect-robust Evaluation of Generated Text. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics; Toronto, ON, Canada, 9–14 July 2023; pp. 6010-6028. [DOI: https://dx.doi.org/10.18653/v1/2023.acl-long.331]
21. Lim, S.B.; Park, C.J.; Yang, Y.W. Deep Learning-based Korean Machine Translation Research Considering Linguistics Features and Service. J. Korean Converg. Soc.; 2022; 13, pp. 21-29.
22. Hwang, J.S.; Yang, H.C. Korean dialect-standard language translation using special token in KoBART. Proceedings of the Symposium of the Korean Institute of Communications and Information Sciences; Pyeongchang, Republic of Korea, 31 January–2 February 2024; pp. 1178-1179.
23. Lee, S.Y.; Jung, D.-E.; Sim, J.Y.; Kim, S.H. Study on Jeju Dialect Machine Translation Utilizing an Open-Source Large Language Model. Proceedings of the Summer Annual Conference of IEIE 2024; Jeju, Republic of Korea, 26–28 June 2024; pp. 2923-2926.
24. Roh, H.G.; Lee, K.H. A Basic Performance Evaluation of the Speech Recognition APP of Standard Language and Dialect using Google, Naver, and Daum KAKAO APIs. Asia-Pac. J. Multimed. Serv. Converg. Art Humanit. Sociol.; 2017; 7, pp. 819-829. [DOI: https://dx.doi.org/10.14257/AJMAHS.2017.12.22]
25. Na, J.; Park, Y.; Lee, B. A Comparative Study on the Biases of Age, Gender, Dialects, and L2 speakers of Automatic Speech Recognition for Korean Language. Proceedings of the 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC); Macau, China, 3–6 December 2024; pp. 1-6. [DOI: https://dx.doi.org/10.1109/APSIPAASC63619.2025.10848815]
26. Bak, S.H.; Choi, S.M.; Jung, Y.C. Voice Recognition Control using LLM for Regional Dialects. Proceedings of the KIIT Conference; Jeju, Republic of Korea, 14–17 October 2025; pp. 617-620.
27. An, S.Y.; Bae, K.H.; Choi, E.B.; Choi, S.; Choi, Y.M.; Hong, S.K.; Hong, Y.J.; Hwang, J.W.; Jeon, H.J.; Jo, G.
28. Cho, K.H.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing; Doha, Qatar, 25–29 October 2014; pp. 1724-1734.
29. Feng, L.; Tung, F.; Ahmed, M.; Bengio, Y.; Hajimirsadeghi, H. Were RNNs All We Needed?. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2410.01201] arXiv: 2410.01201
30. Blelloch, G. Prefix Sums and Their Applications; Technical Report CMU-CS-90-190 School of Computer Science, Carnegie Mellon University: Pittsburgh, PA, USA, 1990.
31. Heinsen, F. Efficient Parallelization of a Ubiquitous Sequential Computation. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2311.06281] arXiv: 2311.06281
32. Luong, T.; Pham, H.; Manning, C.; Su, J. Effective Approaches to Attention-based Neural Machine Translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing; Lisbon, Portugal, 17–21 September 2015; pp. 1412-1421. [DOI: https://dx.doi.org/10.18653/v1/D15-1166]
33. Tiedemann, J.; Thottingal, S. OPUS-MT—Building open translation services for the World. Proceedings of the 22nd Annual Conference of the European Association for Machine Translation; Lisboa, Portugal, 3–5 November 2020; pp. 479-480.
34. Fan, A.; Bhosale, S.; Schwenk, H.; Ma, Z.; El-Kishky, A.; Goyal, S.; Baines, M.; Celebi, O.; Wenzek, G.; Chaudhary, V. Beyond English-Centric Multilingual Machine Translation. J. Mach. Learn. Res.; 2020; 22, pp. 4839-4886.
35. Junczys-Dowmunt, M.; Grundkiewicz, R.; Dwojak, T.; Hoang, H.; Heafield, K.; Neckermann, T.; Seide, F.; Germann, U.; Aji, A.; Bogoychev, N.
36. Tiedemann, J. Parallel Data, Tools and Interfaces in OPUS. Proceedings of the 8th International Conference on Language Resources and Evaluation; Istanbul, Turkey, 23–25 May 2012; pp. 2214-2218.
37. Sim, Y.J.; Lee, W.J.; Kim, H.J.; Kim, H.S. Evaluating Large Language Models on Korean Cultural Understanding in Empathetic Response Generation. Proceedings of the 36th Annual Conference on Human and Cognitive Language Technology; Seongnam, Republic of Korea, 11–12 October 2024; pp. 325-330.
38. Hanna, M.; Bojar, O. A Fine-Grained Analysis of BERTScore. Proceedings of the Sixth Conference on Machine Translation; Online, 10–11 November 2021; Barrault, L.; Bojar, O.; Bougares, F.; Chatterjee, R.; Costa-jussa, M.R.; Federmann, C.; Fishel, M.; Fraser, A.; Freitag, M.; Graham, Y.
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.