Content area

Abstract

In the domain of natural language processing (NLP), a primary challenge pertains to the process of Chinese tokenization, which remains challenging due to the lack of explicit word boundaries in written Chinese. The existing tokenization methods often treat each Chinese character as an indivisible unit, neglecting the finer semantic features embedded in the characters, such as radicals. To tackle this issue, we propose a novel token representation method that integrates radical-based features into the process. The proposed method extends the vocabulary to include both radicals and original character tokens, enabling a more granular understanding of Chinese text. We also conduct experiments on seven datasets covering multiple Chinese natural language processing tasks. The results show that our method significantly improves model performance on downstream tasks. Specifically, the accuracy of BERT on the BQ Croups dataset was enhanced to 86.95%, showing an improvement of 1.65% over the baseline. Additionally, the BERT-wwm performance demonstrated a 1.28% enhancement, suggesting that the incorporation of fine-grained radical features offers a more efficacious solution for Chinese tokenization and paves the way for future research in Chinese text processing.

Details

1009240
Title
A Radical-Based Token Representation Method for Enhancing Chinese Pre-Trained Language Models
Author
Qin, Honglun 1   VIAFID ORCID Logo  ; Li, Meiwen 2   VIAFID ORCID Logo  ; Wang, Lin 1   VIAFID ORCID Logo  ; Ge, Youming 1 ; Zhu, Junlong 1 ; Zheng, Ruijuan 2   VIAFID ORCID Logo 

 School of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China; [email protected] (H.Q.); [email protected] (L.W.); [email protected] (Y.G.); [email protected] (J.Z.) 
 School of Software, Henan University of Science and Technology, Luoyang 471023, China; [email protected] 
Publication title
Volume
14
Issue
5
First page
1031
Publication year
2025
Publication date
2025
Publisher
MDPI AG
Place of publication
Basel
Country of publication
Switzerland
Publication subject
e-ISSN
20799292
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-03-05
Milestone dates
2025-01-23 (Received); 2025-03-04 (Accepted)
Publication history
 
 
   First posting date
05 Mar 2025
ProQuest document ID
3176380797
Document URL
https://www.proquest.com/scholarly-journals/radical-based-token-representation-method/docview/3176380797/se-2?accountid=208611
Copyright
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-11-07
Database
ProQuest One Academic