Fine-Tuning Large Language Models for Kazakh Text Simplification

Abstract

This paper addresses text simplification task for Kazakh, a morphologically rich, low-resource language, by introducing KazSim, an instruction-tuned model built on multilingual large language models (LLMs). First, we develop a heuristic pipeline to identify complex Kazakh sentences, manually validating its performance on 400 examples and comparing it against a purely LLM-based selection method; we then use this pipeline to assemble a parallel corpus of 8709 complex–simple pairs via LLM augmentation. For the simplification task, we benchmark KazSim against standard Seq2Seq systems, domain-adapted Kazakh LLMs, and zero-shot instruction-following models. On an automatically constructed test set, KazSim (Llama-3.3-70B) achieves BLEU 33.50, SARI 56.38, and F1 87.56 with a length ratio of 0.98, outperforming all baselines. We also explore prompt language (English vs. Kazakh) and conduct human evaluation with three native speakers: KazSim scores 4.08 for fluency, 4.09 for meaning preservation, and 4.42 for simplicity—significantly above GPT-4o-mini. Error analysis shows that remaining failures cluster into tone change, tense change, and semantic drift, reflecting Kazakh’s agglutinative morphology and flexible syntax.

Details

Business indexing term

Subject:

Machine learning

Identifier / keyword

fine-tuning; Kazakh language; large language models; text simplification

Title

Fine-Tuning Large Language Models for Kazakh Text Simplification

Author

Alymzhan, Toleu¹

; Gulmira, Tolegen¹

; Ualiyeva Irina²

¹ Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan; [email protected]@satbayev.university (G.T.), AI Research Laboratory, Satbayev University, Almaty 050000, Kazakhstan
² Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan; [email protected]@satbayev.university (G.T.), Faculty of Information Technology, Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan

Publication title

Applied Sciences; Basel

Volume

Issue

First page

8344

Number of pages

Publication year

2025

Publication date

2025

Publisher

MDPI AG

Place of publication

Basel

Country of publication

Switzerland

Publication subject

Sciences: Comprehensive Works

e-ISSN

20763417

Source type

Scholarly Journal

Language of publication

English

Document type

Journal Article

Publication history

Online publication date

2025-07-26

Milestone dates

2025-06-20 (Received); 2025-07-25 (Accepted)

Publication history

First posting date

26 Jul 2025

DOI

https://doi.org/10.3390/app15158344

ProQuest document ID

3239020943

Document URL

https://www.proquest.com/scholarly-journals/fine-tuning-large-language-models-kazakh-text/docview/3239020943/se-2?accountid=208611

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Last updated

2025-11-07

Database

ProQuest One Academic

Fine-Tuning Large Language Models for Kazakh Text Simplification

Content area

Abstract

Details