Content area

Abstract

This paper addresses text simplification task for Kazakh, a morphologically rich, low-resource language, by introducing KazSim, an instruction-tuned model built on multilingual large language models (LLMs). First, we develop a heuristic pipeline to identify complex Kazakh sentences, manually validating its performance on 400 examples and comparing it against a purely LLM-based selection method; we then use this pipeline to assemble a parallel corpus of 8709 complex–simple pairs via LLM augmentation. For the simplification task, we benchmark KazSim against standard Seq2Seq systems, domain-adapted Kazakh LLMs, and zero-shot instruction-following models. On an automatically constructed test set, KazSim (Llama-3.3-70B) achieves BLEU 33.50, SARI 56.38, and F1 87.56 with a length ratio of 0.98, outperforming all baselines. We also explore prompt language (English vs. Kazakh) and conduct human evaluation with three native speakers: KazSim scores 4.08 for fluency, 4.09 for meaning preservation, and 4.42 for simplicity—significantly above GPT-4o-mini. Error analysis shows that remaining failures cluster into tone change, tense change, and semantic drift, reflecting Kazakh’s agglutinative morphology and flexible syntax.

Details

1009240
Business indexing term
Title
Fine-Tuning Large Language Models for Kazakh Text Simplification
Author
Alymzhan, Toleu 1   VIAFID ORCID Logo  ; Gulmira, Tolegen 1   VIAFID ORCID Logo  ; Ualiyeva Irina 2   VIAFID ORCID Logo 

 Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan; [email protected]@satbayev.university (G.T.), AI Research Laboratory, Satbayev University, Almaty 050000, Kazakhstan 
 Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan; [email protected]@satbayev.university (G.T.), Faculty of Information Technology, Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan 
Publication title
Volume
15
Issue
15
First page
8344
Number of pages
24
Publication year
2025
Publication date
2025
Publisher
MDPI AG
Place of publication
Basel
Country of publication
Switzerland
Publication subject
e-ISSN
20763417
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-07-26
Milestone dates
2025-06-20 (Received); 2025-07-25 (Accepted)
Publication history
 
 
   First posting date
26 Jul 2025
ProQuest document ID
3239020943
Document URL
https://www.proquest.com/scholarly-journals/fine-tuning-large-language-models-kazakh-text/docview/3239020943/se-2?accountid=208611
Copyright
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-11-07
Database
ProQuest One Academic