It appears you don't have support to open PDFs in this web browser. To view this file, Open with your PDF reader
Abstract
Even with the recent advancements in machine translation quality, one of the remaining challenges of machine translation is the presence of neologisms in short texts. Neologisms are newly coined words or expressions that are not yet accepted by the mainstream language but are gaining wider usage. Having these words in a sentence lowers the quality of translation since systems often treat these words as proper nouns, leading to a fragmented output. In this project, we develop a pipeline that can automatically detect potential neologisms and find possible replacements for the unknown word. The replacements are found by passing through a fill mask layer of a widely used large language model, BERT, which incorporates surrounding context to estimate the word that will fit into the sentence.
The detection pipeline can separate relatively new words from all unique words in the dataset, but the performance still suffers from being unable to distinguish between neologisms and named entities. By replacing these detected neologisms using the fill mask task of BERT, the system obtained a result of 0.249 BLEU score compared to 0.202 BLEU score from the raw translation. Translation through replacement shows to perform better qualitatively; while raw-translated sentences are highly fragmented, sentences translated after the replacement are more structured. However, even with using a large language model, the system is not capable of detecting the exact meaning of the neologism since the training data of the language model are based on formal text like newspaper articles. This thesis emphasizes the need to conduct further research regarding the identification and processing of neologisms in text and the necessity to develop large language models that are trained on short informal texts which are becoming more dominant in modern days.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer





