Full text

Turn on search term navigation

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

In the domain of natural language processing (NLP), a primary challenge pertains to the process of Chinese tokenization, which remains challenging due to the lack of explicit word boundaries in written Chinese. The existing tokenization methods often treat each Chinese character as an indivisible unit, neglecting the finer semantic features embedded in the characters, such as radicals. To tackle this issue, we propose a novel token representation method that integrates radical-based features into the process. The proposed method extends the vocabulary to include both radicals and original character tokens, enabling a more granular understanding of Chinese text. We also conduct experiments on seven datasets covering multiple Chinese natural language processing tasks. The results show that our method significantly improves model performance on downstream tasks. Specifically, the accuracy of BERT on the BQ Croups dataset was enhanced to 86.95%, showing an improvement of 1.65% over the baseline. Additionally, the BERT-wwm performance demonstrated a 1.28% enhancement, suggesting that the incorporation of fine-grained radical features offers a more efficacious solution for Chinese tokenization and paves the way for future research in Chinese text processing.

Details

Title
A Radical-Based Token Representation Method for Enhancing Chinese Pre-Trained Language Models
Author
Qin, Honglun 1   VIAFID ORCID Logo  ; Li, Meiwen 2   VIAFID ORCID Logo  ; Wang, Lin 1   VIAFID ORCID Logo  ; Ge, Youming 1 ; Zhu, Junlong 1 ; Zheng, Ruijuan 2   VIAFID ORCID Logo 

 School of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China; [email protected] (H.Q.); [email protected] (L.W.); [email protected] (Y.G.); [email protected] (J.Z.) 
 School of Software, Henan University of Science and Technology, Luoyang 471023, China; [email protected] 
First page
1031
Publication year
2025
Publication date
2025
Publisher
MDPI AG
e-ISSN
20799292
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3176380797
Copyright
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.