Abstract

Translate

This paper addresses text simplification task for Kazakh, a morphologically rich, low-resource language, by introducing KazSim, an instruction-tuned model built on multilingual large language models (LLMs). First, we develop a heuristic pipeline to identify complex Kazakh sentences, manually validating its performance on 400 examples and comparing it against a purely LLM-based selection method; we then use this pipeline to assemble a parallel corpus of 8709 complex–simple pairs via LLM augmentation. For the simplification task, we benchmark KazSim against standard Seq2Seq systems, domain-adapted Kazakh LLMs, and zero-shot instruction-following models. On an automatically constructed test set, KazSim (Llama-3.3-70B) achieves BLEU 33.50, SARI 56.38, and F1 87.56 with a length ratio of 0.98, outperforming all baselines. We also explore prompt language (English vs. Kazakh) and conduct human evaluation with three native speakers: KazSim scores 4.08 for fluency, 4.09 for meaning preservation, and 4.42 for simplicity—significantly above GPT-4o-mini. Error analysis shows that remaining failures cluster into tone change, tense change, and semantic drift, reflecting Kazakh’s agglutinative morphology and flexible syntax.

Full text

Turn on search term navigation

Translate

1. Introduction

Text simplification is a task in natural language processing (NLP) that reduces linguistic complexity while preserving the original meaning. It has applications in education, public communication, and accessibility. In education, simplification allows students in primary and secondary schools to better understand complex texts. In public services, simplified documents improve the readability of legal and administrative content. In NLP pipelines, simplification can be used as a preprocessing step to improve downstream tasks such as machine translation, summarization, and information retrieval, especially for low-resource and morphologically rich languages like Kazakh.

Classical rule-based approaches and early statistical methods show limited performance when applied to morphologically rich languages. In contrast, recent large-scale language models (LLMs) [1,2,3,4], pre-trained on multilingual corpora and fine-tuned for generative tasks, have demonstrated promising results on simplification tasks in other languages [5,6,7]. Despite the growing availability of language technologies, Kazakh remains underrepresented in NLP applications, particularly in the area of text simplification. Unlike English and other high-resource languages, Kazakh lacks any publicly available automatic text-simplification tools, has no parallel simplification corpus, and to date no study has systematically evaluated large language models on this task.

In this paper, we present a generative approach for Kazakh text simplification by fine-tuning multilingual large language models. To support model training, complex–simple sentence pairs are constructed through a heuristic-based method in a semi-synthetic manner. In the data construction process, the complexity of Kazakh words in a sentence was measured by their frequency and morphological complexity, and used these scores to filter common words from the text. The final selection of Kazakh complex sentences was performed using a heuristic approach, based on maximum token length, maximum allowed number of conjunctions, and the ratio of common words in the text. To support the development of automatic complexity classification for Kazakh, we conducted a comparative evaluation of two approaches: a simple heuristic method and a large language model (GPT-4o-mini). We randomly sampled sentences and manually annotated them as either complex or simple with the help of a native Kazakh speaker. The heuristic classifier, based on sentence length, achieved better overall accuracy and balanced precision and recall. Based on the assembled dataset, we fine-tuned various multilingual LLMs for the Kazakh text simplification task. We evaluated both instruction-tuned and LLMs, including domain-specific LLMs such as kazLLM and Sherkala, as well as general-purpose LLMs in a zero-shot setting. To benchmark performance, we introduce KazSim, an instruction-fine-tuned model optimized specifically for simplification.

Evaluations were conducted on two test sets: one automatically constructed from the same pipeline used during training, and another semi-manually curated benchmark designed to reflect more natural simplification patterns. We present results based on standard evaluation metrics such as BLEU, ROUGE, and SARI. Experimental results showed that zero-shot and domain-adapted models are limited in their ability to produce structurally simplified and length-controlled outputs. In contrast, KazSim achieved consistently better scores across all metrics and evaluation settings, confirming the importance of task-specific supervision and targeted data construction for simplification in low-resource languages. We further investigated the impact of instruction language by comparing English and Kazakh prompts for the task. While most models showed comparable performance across both settings, slight gains were observed when prompts are given in Kazakh, particularly for zero-shot and domain-specific models. KazSim remained stable under both prompt variants, confirming its robustness and suitability for multilingual deployment. To complement automatic evaluation, we conducted a human study assessing the fluency, meaning preservation, and simplicity of simplified Kazakh texts. We selected two sets of sentence pairs: one from our KazSim model (LLaMA-3.3-70B) and another from GPT-4o-mini. Three native Kazakh speakers rated each pair on a 5-point Likert scale. KazSim achieved high scores across all dimensions, especially in simplicity (4.42), indicating its effectiveness in producing natural, accurate, and easy-to-understand simplifications.

First parallel corpus for Kazakh text simplification: We introduce the first dataset comprising 8709 complex–simple Kazakh sentence pairs, automatically constructed using a heuristic-based complexity detection pipeline (validated on 400 manually annotated sentences) and augmented with LLM-generated simplifications.

KazSim model and baseline: We propose KazSim, a fine-tuned, instruction-following simplification model built on multilingual LLM backbones (e.g., Llama-3.3, Qwen2). KazSim serves as the first benchmarked simplification model specifically optimized for the Kazakh language.

Extensive model evaluation: We provide a comprehensive evaluation of KazSim against standard Seq2Seq architectures, domain-specific Kazakh LLMs, and zero-shot instruction-following models, using both automatically generated and semi-manually created benchmarks. Evaluation metrics include BLEU, SARI, ROUGE, BERTScore, and human evaluations from native Kazakh speakers.

Analysis of prompt-language impact: We systematically compare the effect of English versus Kazakh instructions on simplification performance, demonstrating the advantages of native-language prompts and providing practical insights for multilingual model deployment.

Kazakh-specific error analysis: Through detailed human evaluation, we identify and characterize the main error categories (tone change, tense change, semantic drift), reflecting linguistic features unique to Kazakh, such as agglutinative morphology and flexible syntax.

2. Related Work

Text simplification aims to rewrite the original text into a simpler form while preserving its meaning. One simple way is to replace complex words in the sentences with simpler synonymous words; this process is referred to as lexical simplification. A more advanced method is to reduce the complexity of sentence structures, a process known as syntactic simplification. Existing approaches in these two directions can be categorized into three types: (i) rule-based methods, (ii) data-driven methods, and (iii) generative approaches.

In this direction, most existing studies [5,8,9,10,11] generally follow the following sequence of steps: (i) identify complex words, (ii) generate a set of candidate substitutions, (iii) select the most contextually appropriate alternatives, and (iv) rank the candidate substitutions according to their simplicity.

Early rule-based lexical simplification (LS) system [12] is proposed to simplify English newspaper texts for aphasic readers by combining syntactic analysis and simplification modules. It used linguistic analysis to generate synonym lists from WordNet [8], ranked by frequency from the Oxford Psycholinguistic Database, selecting the most common synonyms for output.

Data-driven lexical simplification approaches use the large parallel datasets of complex and simple texts and employ machine learning techniques to learn text simplification rules. In this direction, Drndarević and Saggion [10] conducted an empirical analysis of lexical simplification in Spanish using a parallel corpus of original and manually simplified texts. Their study identified lexical substitution as the most frequent operation and proposed a taxonomy including the definition insertion and simplification of named entities and numerical expressions. The authors highlighted frequency and word length as key features for synonym selection, while emphasizing the importance of word sense disambiguation for handling polysemy.

Shardlow [11] investigates techniques for automatically identifying complex words, a critical yet often under-addressed component of lexical simplification. Using a corpus derived from Simple Wikipedia edit histories, the study compares methods that include full simplification, frequency thresholding, and a supervised classification approach based on support vector machines. The results indicate that, while machine learning slightly improves precision, it suffers from a substantial loss in recall. The work emphasizes the trade-offs involved in CW identification and highlights its foundational role in downstream simplification tasks.

For syntactic simplification, early approaches were also based on the hand-crafted rules [13], they framed the task as a two-step process of analysis and transformation, using handcrafted rules to split complex structures like relative clauses into simpler sentences. Siddharthan [14] introduced a framework for text simplification using transformation rules applied to typed dependency structures. The study compared different generation strategies and highlighted the trade-offs between preserving original sentence structure and relying on full-surface realization, emphasizing robustness to parsing errors as a key factor in simplification quality. Woodsend and Lapata [15] propose a data-driven approach to sentence simplification using quasi-synchronous grammar and integer linear programming. Their model captures structural mismatches and complex rewrite operations, selecting optimal simplifications from a space of candidate rewrites. Experimental results show that their method improves readability while preserving grammaticality and meaning, without relying on handcrafted rules.

Recent work for text simplification were employed sequence to sequence techniques and the generative approaches based on pre-trained large language models. Zhang and Lapata [16] introduce a deep reinforcement learning framework for sentence simplification, combining an encoder–decoder architecture with a reward function that promotes fluency, simplicity, and meaning preservation. Their model, DRESS, learns simplification rewrites from monolingual corpora and outperforms prior approaches across multiple benchmarks, highlighting the effectiveness of reinforcement learning for optimizing simplification quality. Mallinson et al. [6] introduced a multilingual, zero-shot framework for sentence simplification that transfers simplification knowledge from English to typologically distinct, low-resource languages in the absence of parallel corpora. Their model employs a shared transformer encoder with task- and language-specific layers trained via multi-task learning, enabling the construction of language-agnostic sentence representations. Empirical results on German datasets show that this approach yields higher-quality simplifications than both unsupervised baselines and multi-stage pivoting methods, illustrating the potential of crosslingual supervision for simplification in under-resourced settings.

Kew et al. [7] introduced BLESS, a comprehensive benchmark designed to systematically evaluate the sentence simplification capabilities of large language models (LLMs) for English, a high-resource language. Their analysis involved many models of varying sizes and architectures, tested across multiple domains and prompted under few-shot settings. Their results indicate that many LLMs, including those not specifically trained for simplification, can match or exceed the performance of existing state-of-the-art systems, while also exhibiting the broader coverage of simplification operations. Ryan et al. [17], introduced a multilingual text simplification benchmark, namely MULTISIM. Their work enables consistent evaluation across low-, medium-, and high-resource languages, and demonstrates that multilingual and few-shot models like BLOOM-176b can match or exceed the performance of fine-tuned models on non-English simplification tasks.

3. Methodology

3.1. Task Formulation

In this work, we define Kazakh text simplification as a sequence-to-sequence generation task. The input is a complex sentence in Kazakh, and the goal is to generate its simplified version which is easier to read and understand, but still keeps the original meaning. The model learns to map the input sequence $x = (x_{1}, x_{2}, \dots, x_{n})$ to the output simplified sequence $y = (y_{1}, y_{2}, \dots, y_{m})$ .

Formally, the model estimates the conditional probability:

(1) $P (y ∣ x) = \prod_{t = 1}^{m} P (y_{t} ∣ y_{< t}, x)$

where x is the complex input sentence and y is the simplified output. The variable t denotes the position index in the output sequence. At each time step t, the model predicts the token

y_{t}

based on the input x and the previously generated tokens

y_{< t} = (y_{1}, y_{2}, \dots, y_{t - 1})

The goal is to maximize the likelihood of generating the simplified sentence y given the input x.

3.2. Automatic Identification of Complex Kazakh Sentences

Figure 1 outlines a two-step process for selecting complex sentences from a Kazakh corpus. Several Kazakh books were pre-processed and tokenized for this process.

The first step focuses on identifying complex and common Kazakh words. In the second step, the text was split into sentences. Each sentence was evaluated using a heuristic method to decide whether it is complex. Sentences that meet the criteria were selected and saved as the final output.

To identify morphologically simple and high-frequency lexical words in Kazakh, we define a scoring mechanism that integrates corpus-driven frequency statistics with morphological complexity estimation. The underlying objective is to assign higher priority to words that are both frequently observed in naturally occurring texts and exhibit minimal morphological complexity.

Morphological complexity is estimated through QazAnalyzer [18,19], a finite-state morphological analyzer developed for the Kazakh language. The analyzer decomposes surface word forms into a stem and a sequence of morphological suffixes. We approximate a word’s structural complexity by computing the count of attached suffixes. In cases where the analyzer fails to produce a valid segmentation or classifies a token as “unknown”, a predefined penalty is assigned to reflect maximal complexity, thereby discouraging the selection of such out-of-vocabulary items.

Formally, for a given word w, the scoring function is defined as

(2) $score (w) = {morph}_{score} (w) - α \cdot log (freq (w))$

where

{morph}_{score} (w)

denotes the number of suffixes extracted from the morphological analysis, and

freq (w)

is the word frequency obtained from a large-scale corpus. The parameter

α > 0

is a tunable coefficient that adjusts the influence of frequency in the overall ranking.

Table 1 presents Pearson correlation coefficients between final score, morphological complexity, and log frequency. A moderate positive correlation ( $r = 0.558$ ) is observed between final score and morphological score, indicating that structurally complex words tend to receive higher scores. A strong negative correlation ( $r = - 0.886$ ) between final score and log frequency confirms that frequently used words are consistently ranked lower. The weak negative correlation between morphological score and log frequency ( $r = - 0.120$ ) suggests that morphologically rich forms occur less frequently.

Table 2 shows the top 20 most frequent and morphologically simple Kazakh words based on the lowest final scores. Most have a morph score of 1, reflecting minimal suffixation and occur frequently in the corpus. A few structurally richer forms (morph score 2) are included due to their high usage.

To extract complex sentences, we implement a rule-based filtering approach that identifies complex candidates from an input set. Algorithm 1 first tokenizes each sentence and computes three primary features: (i) Total number of tokens; (ii) The count of known coordinating or subordinating conjunctions; (iii) The proportion of common words (extracted from the first step).

Algorithm 1: Extract complex Kazakh sentences.

A sentence is considered structurally complex if it satisfies the following two constraints: (1) it either exceeds a specified token length threshold or contains a high number of conjunctions, and (2) the proportion of common words within the sentence falls below a predefined threshold. Only sentences meeting both conditions are retained. This selection strategy enables the construction of a filtered corpus consisting of lexically and syntactically complex sentences. After extracting all complex sentences, we used GPT-4o-mini to simplify these sentences.

To assess the reliability of complexity classification in Kazakh text, we conducted a comparative evaluation between a simple heuristic approach and GPT-4o-mini. We randomly selected a set of 400 Kazakh sentences, covering a broad range of linguistic structures. Each sentence was manually annotated by a native speaker of Kazakh, assigning a binary label: complex or simple. Table 3 presents the evaluation results of two automatic sentence complexity classification approaches, a heuristic classifier and GPT-4o-mini compared against manual annotations by native Kazakh speakers. The table reports standard classification metrics: accuracy, precision, recall, and F1 score.

The heuristic approach, which uses a rule based on sentence length, achieved an accuracy of 83.46% with equal precision, recall, and F1 score (83.25%). The GPT-4o-mini model achieved a slightly lower accuracy (79.20%) but demonstrated stronger recall (86.29%), indicating its effectiveness in detecting complex sentences. However, its precision (75.22%) was lower, suggesting more false positives than the heuristic approach. The F1 score (80.38%) reflects its overall balanced performance between precision and recall.

3.3. Seq2Seq Model

As a baseline for Kazakh text simplification, we implement a standard sequence-to-sequence architecture based on long short-term memory (LSTM) layers. The model consists of an encoder and decoder, both incorporating word embeddings and multi-layer LSTM networks. The encoder processes the input complex sentence and encodes it into a hidden representation. It includes an embedding layer followed by an LSTM, and the final hidden and cell states are passed to the decoder. The decoder generates the simplified sentence sequentially, using its own embedding and LSTM layers, followed by a linear projection to the output vocabulary. During training, we apply teacher forcing with a fixed ratio of 0.5. At each decoding step, the model receives either the ground-truth token or its own previous output. The objective is to minimize cross-entropy loss between predicted and reference tokens. Padding positions are masked during optimization.

This baseline provides a reference point for evaluating the performance gains introduced by instruction-tuned language models, and serves as a foundation for analyzing the simplification behavior under low-resource conditions.

3.4. Fine-Tuning Kazakh Text Simplification

To improve over the baseline Seq2Seq model, we fine-tune large pre-trained language models (LLMs) for the Kazakh text simplification task. While the Seq2Seq model is effective for learning from aligned complex–simple sentence pairs, it often struggles with long-distance dependencies, rare morphological patterns, and fluency. Large language models, pre-trained on massive multilingual corpora, are better suited for such challenges. Their ability to generalize from limited fine-tuning data makes them a promising solution for low-resource languages like Kazakh.

We use three instruction-tuned LLMs in the experiments: Llama-3.2-3B, Llama-3.3-70B, and Qwen2-72B-Instruct. These models are selected due to their open access, multilingual support, and compatibility with parameter-efficient tuning. In this work, each model is evaluated in two modes: (1) zero-shot or instruction-only inference without additional fine-tuning, and (2) fine-tuning on our Kazakh simplification dataset. This allows us to analyze the effect of instruction tuning alone versus task-specific adaptation on simplification quality.

Due to the computational cost of full fine-tuning, we apply low-rank adaptation (LoRA) [20] to all models. LoRA introduces small trainable matrices into the attention and feed-forward layers of the transformer without modifying the original weights. This allows for the efficient training of large models on limited hardware. For convenience, we denote the fine-tuned LLMs for Kazakh text simplification as KazSim.

4. Experiments

4.1. Dataset

We selected 8709 sentence pairs for training and 500 for testing. Table 4 reports token and character-level statistics for complex and simple sentences across both splits. As expected, complex sentences are consistently longer than their simplified counterparts in terms of both token count and character length. Vocabulary size is higher in the training set due to scale, while the type–token ratio is elevated in the test set, reflecting reduced repetition in smaller samples. These statistics provide a general overview of length and lexical variation prior to modeling. In addition to the test set described above, we include a semi-manually created test set with 163 of complex–simple sentence pairs for Kazakh text simplification for evaluation purposes.

4.2. Baselines

We consider three categories of baselines: classical sequence-to-sequence models, domain-specific large language models trained on Kazakh data (kazLLM and Sherkala), and general-purpose LLMs evaluated in a zero-shot setting.

(1) Seq2Seq refers to a standard encoder–decoder architecture built with LSTM layers.

(2) Sherkala [21] is a domain-adapted LLM pretrained on a mix of Kazakh and multilingual corpora. It supports instruction-based prompting but has not been fine-tuned specifically for text simplification. We evaluate the Sherkala-Llama-3.1-8B model to establish a performance reference for general-purpose generation in a Kazakh-rich setting.

(3) kazLLM-Llama-3.1 are large language models trained with high-resource coverage of Kazakh, Russian, Turkish, and English data. Similar to Sherkala, it is evaluated without any simplification-specific tuning. We test both 8B and 70B variants to assess the role of scale in the absence of task alignment.

(4) Zero-shot LLMs refer to publicly available instruction-tuned models (e.g., Llama-3.2-3B, Llama-3.3-70B and Qwen2-72B) used without any additional fine-tuning. These models are prompted using a standard instruction template, but are not specially adapted to Kazakh or to simplification. They provide an upper-bound baseline for off-the-shelf generation and allow us to assess the gap between general-purpose LLMs and models explicitly adapted to the target language and task.

4.3. Model Setup and Training

Two variants of the Seq2Seq models were explored: a small and a large version. Both models share the same overall architecture and training settings, including two-layer LSTM encoder and decoder, a dropout rate of 0.3, a batch size of 128, and the use of the Adam optimizer with a learning rate of 1e-3. Early stopping is applied with a patience of 20 epochs to prevent overfitting. The small model uses an embedding dimension of 128 and a hidden size of 256, while the large model doubles these values with an embedding dimension of 256 and a hidden size of 512.

For all three LLM models, LoRA was applied to the transformer layers using rank-32 adapters with a dropout of 0.1. Fine-tuning is performed using the constructed training set of parallel complex and simplified Kazakh sentences. LLMs are trained with the AdamW optimizer, a learning rate of 2e-5, a batch size of 8, and for 3 epochs. All preprocessing steps, including tokenization, follow the original tokenizer associated with each model.

Figure 2 shows the training loss comparison between small and large Seq2Seq models. The larger model converges significantly faster and achieves a lower final training loss, indicating greater learning capacity and better optimization behavior. In contrast, the small model converges more slowly and stabilizes at a higher loss, which may indicate underfitting due to limited model expressiveness. This comparison confirms that increased model capacity contributes positively to the learnability of the simplification task.

Figure 3 presents the training loss trajectories of the KazSim model fine-tuned on three large language models: Qwen2-72B, Llama-3.2-3B, and Llama-3.3-70B. Among them, Llama-3.3-70B achieved the lowest and most stable training loss throughout, indicating better alignment with the target simplification objective. The 3B model shows higher and more volatile loss, while Qwen2-72B demonstrates intermediate behavior.

4.4. Evaluation Metrics

To evaluate simplification quality, we use the following standard automatic metrics: BLEU, ROUGE-L, SARI, Bertscore. Each metric captures different aspects of output quality, including lexical overlap, structural alignment, and simplification-specific edits.

BLEU measures n-gram overlap between the system output and the reference. While originally developed for machine translation, it is widely used in simplification.

ROUGE-L computes the longest common subsequence between the prediction and the reference. Compared to BLEU, it is less sensitive to exact token matches and better reflects sentence-level alignment and fluency.

SARI is designed specifically for simplification. It evaluates the system output against both the reference and the original complex sentence, and scores three operations: addition, deletion, and retention.

BERTScore [22] leverages contextual embeddings from pretrained language models to compute the similarity between candidate and reference sentences. It reports precision, recall, and F1 based on semantic alignment, and is particularly useful for capturing meaning preservation even when surface-level tokens differ.

4.5. Results

Table 5 and Table 6 present the evaluation results for baseline sequence-to-sequence models and various LLMs on the Kazakh text simplification task.

Table 5 presents evaluation results on the Kazakh text simplification dataset. Seq2Seq baselines fail to generalize, producing negligible BLEU and ROUGE scores, indicating traditional architectures lack sufficient capacity to model simplification under low-resource constraints. Zero-shot LLMs also struggle, with all variants producing overextended outputs (length ratios between 3.17 and 5.38). While Llama-3.3-70B shows moderate gains in BLEU (5.53) and F1 (73.24), the absence of length control limits overall performance.

Domain-specific models, including kazLLM and Sherkala, demonstrate stable performance across all metrics and clearly outperform both Seq2Seq baselines and zero-shot LLMs. BLEU scores range from 19.59 to 21.52, and all three configurations achieve F1 scores above 83, indicating strong surface-level fluency and content preservation. In contrast to zero-shot outputs, length ratios remain close to 1.0, confirming better control over generation length.

Among these models, kazLLM-70B achieves the highest BLEU score (21.52) and the highest recall (86.65), with a length ratio of 1.06. Sherkala-8B, while slightly behind in BLEU (19.59), achieves the highest precision (82.58) and a longer average output (length ratio = 1.17). kazLLM-8B falls between the two, with balanced precision and recall (82.52/84.51) and a BLEU score of 20.72, showing that smaller-scale models can still generalize well when exposed to sufficient domain-specific data.

KazSim models outperform all baselines. KazSim (Llama-3.3-70B) achieves the best overall results, with a BLEU of 33.5, F1 of 87.56, and a near-optimal length ratio of 0.98. Other KazSim variants (Qwen2-72B and Llama-3.2-3B) also perform well, confirming that targeted finetuning on task-specific Kazakh data is critical for achieving high-quality simplification.

Table 6 reports results on the semi-manually created test set for Kazakh text simplification. Compared to the generated test, performance patterns remain consistent, but scores are generally lower across all metrics. This drop suggests that the manually curated references are more diverse and structurally dissimilar from the model outputs, increasing the difficulty of achieving high lexical overlap.

Zero-shot models again show weak performance, with BLEU scores below 5 and F1 scores ranging from 58.42 to 72.08. Length ratios remain substantially inflated (3.06–5.60), indicating persistent overgeneration. Despite minor gains in F1, these models continue to underperform across all metrics, confirming the limitations of zero-shot simplification in low-resource settings.

Domain-specific models, kazLLM and Sherkala, maintain relatively good performance. BLEU scores fall between 16.35 and 17.09, and F1 remains above 82 for all configurations. Length ratios range from 1.04 to 1.25, consistent with more controlled generation behavior.

KazSim again outperforms all other approaches. KazSim (Llama-3.3-70B) achieves the highest BLEU (20.33) and F1 (84.27), with a near-optimal length ratio of 0.99. Other KazSim variants (Llama-3.2-3B and Qwen2-72B) also perform strongly, confirming that the benefits of fine-tuning extend to harder test cases with more diverse simplification references.

In comparison to the automatic test split, all models show slightly reduced BLEU and ROUGE scores on the manual set. This suggests that the manual references contain more lexical and syntactic variation, reducing surface-level overlap. However, models like KazSim that are trained on task-aligned supervision still generalize well, with minimal drop in F1 and consistent length control. Overall, the results reinforce the robustness of KazSim across evaluation settings and confirm that simplification in low-resource languages requires explicit adaptation not only to the language but also to the task.

Figure 4 presents SARI scores for all models evaluated on both the automatic and semi-manual test sets. SARI is used as the primary metric for measuring the simplification quality, as it captures the balance between content preservation, the deletion of unnecessary information, and the appropriate addition of simplified expressions.

Seq2Seq baselines perform the worst, with SARI scores of 33.56 and 33.60. These results reflect the inability of standard encoder–decoder models to generalize under low-resource scenarios. Zero-shot models demonstrate slight performance improvements, with scores ranging from 33.92 (Llama-3.2-3B) to 40.02 (Llama-3.3-70B). Notably, performance on the semi-manual test set remains stable or slightly improves across all zero-shot models. This suggests that zero-shot models are not particularly sensitive to test set construction and may rely on generic rewriting patterns that generalize equally across both test types.

Domain-specific models, including kazLLM and Sherkala, achieve higher SARI scores in the 42.86–45.80 range, and show relatively small gaps between the two test sets. This indicates better stability and improved simplification quality when models are pretrained on Kazakh data.

KazSim models achieve the highest scores overall. KazSim (Llama-3.3-70B) reaches 56.38 on the automatic test set and 48.42 on the semi-manual set. Other KazSim variants (Qwen2 and Llama-3.2-3B) also outperform all baselines and zero-shot models. Although there is a drop in SARI when moving to the manual test set, KazSim maintains a clear advantage, showing that task-specific supervision enables better generalization.

Table 7 and Table 8 summarize the confidence and variability of model performance on both automatically generated and semi-manually annotated test sets. Each table reports 95% confidence intervals and standard deviations for BLEU and SARI scores, allowing a comparison of model stability and reliability. Fine-tuned models, particularly KazSim (Llama-3.3-70B), not only achieve higher average scores, but also exhibit narrower confidence intervals and lower standard deviations, indicating more consistent performance.

To assess prompt sensitivity, we compare model outputs when using either English or Kazakh instruction prompts for the same simplification task. Results are summarized in Table 9. Overall, model performance remains relatively stable across prompt languages, though minor variations are observed. For domain-specific models (e.g., kazLLM and Sherkala), Kazakh prompts yield slightly higher SARI scores, indicating better alignment with simplification objectives when instructions are provided in the target language. Sherkala, for instance, improves from 44.23 to 46.15 in SARI with a Kazakh prompt. Zero-shot performance is more sensitive to prompt language. Llama-3.3-70B sees a notable increase in BLEU (from 4.53 to 8.26) and F1 (from 72.08 to 76.42) under Kazakh instructions, suggesting that instruction-following behavior improves when prompts are better aligned with the target output language. The proposed KazSim model remains robust under both conditions, achieving the best results across all metrics. Performance is slightly higher with Kazakh prompts, reaching a SARI of 48.78 and an F1 of 84.39, confirming the benefit of aligning the prompt language with the generation task.

Figure 5 demonstrates that prompt language has a significant impact on the quality of text simplification. Kazakh prompts consistently yield outputs that are more faithful to the original in terms of meaning, tone, and cultural context. For example, subtle emotional cues and complex sentence structures are better preserved when the model is guided by a Kazakh prompt (e.g., Sentences 1, 3, 5), whereas English prompts often lead to oversimplification or semantic drift. For example, in Sentence 1, the English-prompt version introduces the notion of a “beautiful image” (korikti suretpen) that does not exist in the original, altering the emotional nuance. In Sentence 4, metaphorical language such as “thorn” (shengelden) is inserted, which confuses the meaning and diverges from the original description of a crowd surrounding houses. These findings highlight that, for morphologically rich, low-resource languages like Kazakh, using native-language prompts during instruction tuning can lead to more accurate and nuanced simplifications.

5. Human Evaluation

To complement automatic metrics, we conducted a human evaluation study to assess the fluency, meaning preservation, and simplicity of simplified Kazakh texts. We selected two sets of 30 sentence pairs: one generated by our fine-tuned KazSim model (LLaMA-3.3-70B) and another from the test set outputs of GPT-4o-mini. Each sentence pair includes the original complex sentence and its simplified counterpart.

Three native Kazakh speakers participated in the evaluation. Among them, two are trained computational linguists familiar with natural language processing techniques, while the third is a native Kazakh speaker without formal training in linguistics, which was included to represent the perspective of general users. Each evaluator rated the sentence pairs independently on a 5-point Likert scale across three dimensions:

Fluency: How natural and grammatically correct the simplified sentence sounds.

Meaning preservation: How accurately the simplified sentence retains the meaning of the original.

Simplicity: How much simpler and easier to understand the output is compared to the original.

Table 10 shows the average scores given to the KazSim outputs. The model achieved high marks in all three criteria, particularly in simplicity, with an average score of 4.42, followed by meaning preservation (4.09) and fluency (4.08). These results suggest that KazSim is capable of generating simplified sentences that are not only easier to understand but also preserve the original meaning and are grammatically correct.

For comparison, Table 11 presents the evaluation of GPT-4o-Mini outputs. While the outputs were generally fluent and preserved meaning to a reasonable degree, they scored lower across all criteria, especially in simplicity. Interestingly, although KazSim was trained on data generated by GPT-4o-mini, its outputs were consistently rated higher across all dimensions. This suggests that fine-tuning with aligned sentence pairs and supervised learning significantly improves quality beyond what the original generator can achieve. This demonstrates the value of model alignment and supervised fine-tuning, even when training data comes from the same underlying source. It also indicates that LLaMA-3.3-70B, once fine-tuned, adapts well to Kazakh stylistic and syntactic nuances.

Figure 6 lists the seven Kazakh sentence pairs with the highest average quality scores, showing each original and simplified version (with English translations).

6. Error Analysis

The human evaluation shows that most simplification errors fall into three core categories. Below in Figure 7 we show error category, illustrate it with a representative example.

Tone change: A major shift in stylistic or emotional quality (e.g., poetic to neutral, ironic to neutral, emotional to neutral, descriptive to neutral) that substantially alters the sentence’s tone.

Tense change: Mis-renders the temporal reference (past, present, or future) of an event, e.g., recasting a forthcoming or ongoing action as already completed—thereby misleading the reader about when something happens.

Semantic drift: Small lexical or structural modifications (a word swap, dropped qualifier, wrong suffix, etc.) that inadvertently changes what the sentence means—introducing errors, nonsense, or subtle shifts in who is doing what or why.

Figure 8 illustrates the relative frequency of three error types across 163 sentence–pair comparisons. We have manually checked the result of KazSim (Llama-3.3-70B) model. Semantic drift emerges as the most common discrepancy, appearing in 35.6% of the pairs, while alterations in tone account for 30.7% of the mismatches. In contrast, shifts in tense are comparatively rare, affecting just 5.5% of the examples. Altogether, these three error categories comprise roughly 71.8% of all pairs, leaving the remaining 28.2% of sentences free from detectable tone, tense, or meaning changes.

7. Discussion

The human evaluation and error analysis reveal that automated Kazakh text simplification still falls short in three high-impact areas, and all of which are deeply intertwined with the language’s unique characteristics and the sociocultural environment in which it operates.

7.1. Interplay of Errors and Kazakh Morphosyntax

Kazakh’s agglutinative morphology and flexible clause structure both contribute to and exacerbate these errors:

A single misplaced suffix can entirely invert meaning a words’ meaning (semantic drift), as in ID 3 (Figure 7), when “led astray” became “reduced”.

Participial constructions signal imminent or ongoing events; misreading their scope leads to tense change, recasting future or ongoing revolts into past occurrences (ID 2) (Figure 7).

Rhetorical devices and topicalization patterns (common in formal or literary Kazakh) are flattened into neutral word order, triggering tone change and erasing irony or emphasis (ID 25) (Figure 7).

7.2. The Resource Gap and Its Consequences

With no large-scale, human-aligned Kazakh complex–simple corpus available prior to this work, the proposed models are trained on generative-parallel data and heuristic approach. Furthermore, GPT-4o-mini simplifications suffer from the same three errors category. This weak supervision propagates the following mistakes:

Morphological misanalysis: Wrong suffix interpretation (e.g., azdyryp → azaityp) that shifts or destroys the original meaning.

Clause boundary inconsistencies: Improper handling of participial and nominal clauses leads to tense change (misplaced temporal scope) and tone flattening (loss of rhetorical structure).

Lack of readability-annotated benchmarks: Without human-rated corpora, automatic metrics cannot be tuned to catch semantic drift or style errors, leaving quality issues hidden.

7.3. Sociolinguistic Stakes

In multilingual Kazakhstan, simplification is not purely academic but a matter of educational equity and civic inclusion:

Educational equity: Learners, especially children or adult L2 speakers, need graded texts that build vocabulary and grammar incrementally.

Digital and civic inclusion: Official communications in dense formal Kazakh (health advisories, legal notices) must retain tone and factual precision to reach all audiences.

Language preservation: Simplified Kazakh content strengthens its digital presence and supports intergenerational transmission.

7.4. Future Direction

Based on these findings, we outline three strategic directions:

Morphologically informed modeling: Integrate a Kazakh morphological analyzer or character-level validation into the simplification pipeline to catch suffix-level drift.

Style and tense aware objectives: Augment training with auxiliary losses that penalize deviations in rhetorical register and enforce temporal consistency.

Human-centered resource development: Expand and refine the parallel corpus with diverse genres (news, legal texts, educational materials) annotated for readability, tone, and temporal framing, and establish a public benchmark with native speaker ratings.

Prompt engineering: This study employs straightforward prompts in both English and Kazakh. Future work could investigate the impact of prompt complexity and explore hybrid or optimized prompt designs tailored to the task.

8. Conclusions

This work presents a comprehensive study of text simplification for Kazakh, a low-resource and morphologically rich language. To support training, we construct a parallel simplification dataset by first identifying complex sentences using a heuristic approach. We evaluated this heuristic method and GPT-4o-mini for Kazakh sentence complexity classification using 400 manually annotated examples. The heuristic achieved over 83.25% F1 with balanced precision and recall, while GPT-4o-mini attained a 80.38% F1. For each selected complex sentence, a corresponding simplified version is generated using LLMs enabling scalable data creation without full manual annotation. In addition to the test set, we also created a semi-manual test set of complex–simple sentence pairs for Kazakh text simplification for real practical evaluation purposes. The proposed model, KazSim, is trained via instruction tuning on top of various Llama-3.3 and Qwen2 backbones and evaluated alongside a diverse set of baselines, including classical Seq2Seq models, Kazakh domain-specific LLMs, and zero-shot instruction-following models.

We first evaluate all models on the automatically constructed test set, derived from the same pipeline used for training data generation. Standard Seq2Seq models perform poorly, with BLEU scores below 0.01 and negligible ROUGE values. These results highlight the limited capability of classical encoder–decoder architectures to handle simplification in low-resource settings without external supervision or pretraining.

Zero-shot models show limited improvements, with Llama-3.3-70B achieving the highest BLEU (5.53) and SARI (40.02) in this category. However, length ratios remain high (3.17–5.38), indicating uncontrolled output length and overgeneration. Precision and recall are also lower than those of task-tuned models, confirming the limitations of zero-shot approaches in structure-sensitive tasks like simplification. Domain-specific models such as kazLLM and Sherkala, produce more fluent and compact outputs, with BLEU scores around 20 and F1 scores exceeding 83. Among these, kazLLM-70B reaches the highest BLEU (21.52) and recall (86.65), with a length ratio of 1.06, indicating stronger alignment with reference length. KazSim outperforms all baselines across the board. The best configuration, KazSim based on Llama-3.3-70B, achieves the highest BLEU (33.5), SARI (56.38), and F1 (87.56), while maintaining a near-optimal length ratio of 0.98. Other KazSim variants also show strong performance, confirming the benefits of instruction tuning on task-specific data.

We further evaluate all models on a semi-manually created benchmark designed to reflect more natural simplification patterns. Zero-shot models continue to under-perform. BLEU scores remain low ranging from 0.29 to 4.53 and length ratios remained high, indicating persistent over-generation. Domain-specific models such as kazLLM-Llama-3.1-8B, kazLLM-Llama-3.1-70B, and Sherkala-8B demonstrate stable behavior, with BLEU scores between 16.35 and 17.09 and F1 scores exceeding 82. The best-performing configuration KazSim based on Llama-3.3-70B achieves a BLEU score of 20.33, F1 of 84.27, and maintained a balanced length ratio of 0.99.

Overall, these results confirm that strong simplification performance in low-resource settings cannot be achieved through scale or domain adaptation alone. While general-purpose and domain-specific LLMs produce fluent outputs, they struggle with structure, length control, and simplification-specific alignment. In contrast, KazSim, fine-tuned with instruction-level supervision on training data, consistently yields better output quality across both evaluation settings.

We also evaluate the effect of prompt language by comparing performance under English and Kazakh instructions. While the differences are generally small, models show slightly better results when prompted in Kazakh. This trend is more visible for zero-shot and domain-adapted models, which benefit from alignment between instruction and output language. KazSim remained stable across both settings, confirming its robustness to prompt variation. These findings suggest that prompt formulation plays a role in model behavior, especially in multilingual setups where instruction language may influence generation quality.

Finally, we conducted a human evaluation with three native Kazakh speakers over 30 sentence pairs. KazSim achieved average ratings of 4.08 (fluency), 4.09 (meaning preservation), and 4.42 (simplicity), significantly outperforming GPT-4o-mini (3.60, 3.67, 3.80, respectively). Error analysis found that remaining failures cluster into three categories: tone change, tense change, and semantic drift—reflecting Kazakh’s agglutinative morphology, flexible syntax, and the lack of large-scale curated resources. Addressing these challenges via morphologically aware modeling, style- and tense-aware objectives, and expanded human-validated corpora will be critical next steps towards human-level Kazakh text simplification.

Author Contributions

Conceptualization, A.T. and G.T.; methodology, A.T.; software, A.T. and G.T.; validation, A.T., G.T. and I.U.; formal analysis, A.T.; investigation, G.T.; resources, A.T. and G.T.; data curation, A.T. and G.T.; writing—original draft preparation, A.T.; writing—review and editing, A.T. and G.T.; visualization, A.T. and G.T.; supervision, A.T. and G.T.; project administration, A.T. and I.U.; funding acquisition, I.U. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset consisting of Kazakh complex–simple sentence pairs, which is released in this repository: https://github.com/a-toleu/KazSim (accessed on 18 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 Pipeline for selecting complex Kazakh sentences from text corpora.

Figure 2 Training loss comparison between small and large Seq2Seq models.

Figure 3 The training loss of trajectories of KazSim fine-tuned on three large language models.

Figure 4 Results of Sari score of various models for two different datasets.

Figure 5 Comparison of original complex sentences and simplified outputs generated by KazSim Using English and Kazakh prompts.

Figure 6 Example output of KazSim for Kazakh sentence pairs with English translation.

Figure 7 Representative examples of the three main error categories.

Figure 8 Distribution of error types (tone change, tense change, semantic drift, and no error) across 163 sentence pairs.

Table 1

Pearson correlation coefficients between key variables.

Variable Pair	Correlation (r)	p-Value
score vs. ${morph}_{score}$	0.558	< $1 \times 10^{- 300}$
score vs. Log freq	$- 0.886$	< $1 \times 10^{- 300}$
${morph}_{score}$ vs. Log freq	$- 0.120$	$1.084 \times 10^{- 158}$

Table 2

Top 20 simplest and most frequent Kazakh words of the corpus. (Kazakh words are shown in Latin script for convenience).

Word	Frequency	${morph}_{score}$	Score
da	8309	1	−8.025095
dep	6124	1	−7.719971
osy	4618	1	−7.437717
edi	3287	1	−7.097731
emes	2436	1	−6.798113
de	6485	2	−6.777247
eken	2276	1	−6.730175
degen	2263	1	−6.724447
endi	2253	1	−6.720018
kop	2197	1	−6.694848
gana	2181	1	−6.687539
oz	2163	1	−6.679251
bir	5809	2	−6.667164
ozi	1851	1	−6.523481
birak	1808	1	−6.499977
ne	1753	1	−6.469084
bul	4413	2	−6.392310
goi	1576	1	−6.362645
abai	4106	2	−6.320205
bar	4078	2	−6.313362

Table 3

Evaluation results of complexity classification models against human annotation.

Model	Accuracy	Precision	Recall	F1 Score
Heuristic approach	83.46%	83.25%	83.25%	83.25%
GPT-4o-mini (high)	79.20%	75.22%	86.29%	80.38%

Table 4

Token and character-level statistics for complex and simple sentences in each dataset split. TTR—type token ratio.

Dataset	avg_tokens	max_tokens	min_tokens	median_tokens	avg_chars	vocab_size	TTR
Train–Complex	21.32	421	5	19	148.94	50,667	0.2728
Train–Simple	16.20	284	1	15	113.60	34,791	0.2466
Test–Complex	22.47	446	7	20	158.42	6749	0.6007
Test–Simple	16.90	238	1	15.5	119.28	4770	0.5646

Table 5

Evaluation results on the Kazakh text simplification dataset. Bolded values indicate the highest performance across all models for each metric.

Model	BLEU	ROUGE-1	ROUGE-L	Length Ratio	Precision	Recall	F1
Seq2Seq-small	0.0038	0.25%	0.25%	0.90	66.26	65.86	66.04
Seq2Seq-large	0.007	3.81%	3.63%	0.82	67.39	65.88	66.61
kazLLM-Llama-3.1-8B	20.72	42.09%	40.57%	1.22	82.52	84.51	83.45
kazLLM-Llama-3.1-70B	21.52	44.57%	43.70%	1.06	81.63	86.65	84.04
Sherkala-Llama-3.1-8B	19.59	42.35%	40.83%	1.17	82.58	85.25	83.85
Llama-3.2-3B (zero-shot)	0.0028	0.56%	0.57%	5.38	57.57	60.47	58.91
Qwen2-72B (zero-shot)	0.013	3.08%	2.97%	4.27	59.71	63.39	61.28
Llama-3.3-70B (zero-shot)	0.055	23.47%	22.47%	3.17	70.12	77.13	73.24
KazSim (Llama-3.2-3B)	25.7	47.01%	46.02%	0.99	84.97	85.61	85.25
KazSim (Qwen2-72B)	27.8	48.56%	47.60%	0.96	75.20	81.14	78.03
KazSim (Llama-3.3-70B)	33.5	54.21%	53.00%	0.98	87.49	87.70	87.56

Table 6

Evaluation results on the semi-manually created test set for Kazakh text simplification. Bolded values indicate the highest performance across all models for each metric.

Model	BLEU	ROUGE-1	ROUGE-L	Length Ratio	Precision	Recall	F1
kazLLM-Llama-3.1-8B	17.09	38.80%	36.96%	1.04	82.52	83.13	82.78
kazLLM-Llama-3.1-70B	16.35	41.24%	39.59%	1.25	81.84	85.76	83.72
Sherkala-Llama-3.1-8B	17.08	40.61%	38.63%	1.13	82.80	84.40	83.55
Llama-3.2-3B (zero-shot)	0.002	0.67%	0.67%	5.60	57.11	59.96	58.42
Qwen2-72B (zero-shot)	0.012	3.51%	3.19%	4.30	60.33	64.11	61.94
Llama-3.3-70B (zero-shot)	0.045	20.71%	19.21%	3.06	69.80	74.98	72.08
KazSim (Llama-3.2-3B)	17.82	39.89%	38.49%	1.01	83.72	83.11	83.37
KazSim (Qwen2-72B)	18.31	40.16%	38.83%	0.97	75.20	81.14	78.03
KazSim (Llama-3.3-70B)	20.33	42.26%	40.50%	0.99	84.76	83.87	84.27

Table 7

The 95% confidence intervals (CI) and standard deviations (Std) for BLEU and SARI scores across different models.

Model	BLEU CI (Low)	BLEU CI (High)	BLEU Std	SARI CI (Low)	SARI CI (High)	SARI Std
kazLLM-Llama-3.1-8B	19.09	22.36	0.83%	44.87	46.77	0.49%
kazLLM-Llama-3.1-70B	20.06	23.08	0.78%	43.05	44.49	0.36%
Sherkala-Llama-3.1-8B	17.73	21.48	0.91%	43.72	45.56	0.47%
Llama-3.2-3B (zero-shot)	0.15	0.43	0.07%	32.94	34.87	0.50%
Qwen2-72B (zero-shot)	1.07	1.56	0.12%	34.64	36.54	0.51%
Llama-3.3-70B (zero-shot)	4.93	6.27	0.34%	39.22	40.89	0.43%
KazSim (Llama-3.2-3B)	23.45	27.29	1.00%	47.98	49.85	0.48%
KazSim (Qwen2-72B)	25.54	29.92	1.15%	51.70	53.87	0.55%
KazSim (Llama-3.3-70B)	30.94	35.69	1.21%	55.23	57.47	0.58%

Table 8

The 95% confidence intervals (CI) and standard deviations (Std) for BLEU and SARI scores on the semi-manually annotated test set.

Model	BLEU CI (Low)	BLEU CI (High)	BLEU Std	SARI CI (Low)	SARI CI (High)	SARI Std
kazLLM-Llama-3.1-70B	14.24	18.36	1.09%	41.88	43.82	0.50%
kazLLM-Llama-3.1-8B	14.58	19.48	1.28%	43.87	46.89	0.80%
Sherkala-Llama-3.1-8B	14.67	19.53	1.21%	42.69	45.72	0.75%
Llama-3.2-3B (zero-shot)	0.08	0.49	0.11%	34.64	38.25	0.92%
Qwen2-72B (zero-shot)	0.91	1.74	0.21%	36.48	40.09	0.90%
Llama-3.3-70B (zero-shot)	3.55	5.73	0.56%	38.62	41.31	0.69%
KazSim (Llama-3.2-3B)	15.18	20.13	1.26%	43.19	45.76	0.69%
KazSim (Qwen2-72B)	15.94	20.64	1.19%	46.17	49.07	0.74%
KazSim (Llama-3.3-70B)	17.80	22.47	1.18%	46.91	50.05	0.80%

Table 9

Comparison of model performance under English and Kazakh instruction prompts. Metrics reported on the semi-manual test set include BLEU, SARI, and F1.

Model	English Prompt			Kazakh Prompt
	BLEU	SARI	F1	BLEU	SARI	F1
Sherkala-Llama-3.1-8B	17.08	44.23	83.55	15.78	46.15	82.66
kazLLM-Llama-3.1-70B	16.34	42.85	83.72	16.22	42.30	83.62
kazLLM-Llama-3.1-8B	17.09	45.34	82.78	15.85	46.42	82.48
Llama-3.3-70B (zero-shot)	4.53	39.95	72.08	8.26	40.73	76.42
KazSim (Llama-3.3-70B)	20.32	48.42	84.27	21.03	48.78	84.39

Table 10

Per-evaluator average ratings for KazSim’s simplified texts across fluency, meaning preservation, and simplicity.

Evaluator	Fluency	Meaning Preservation	Simplicity
Evaluator 1	3.50	3.67	3.93
Evaluator 2	4.43	4.27	4.77
Evaluator 3	4.30	4.33	4.57
Average	4.08	4.09	4.42

Table 11

Per-evaluator average ratings for GPT-4o-mini’s simplified texts across fluency, meaning preservation, and simplicity.

Evaluator	Fluency	Meaning Preservation	Simplicity
Evaluator 1	3.77	3.67	3.70
Evaluator 2	3.60	3.67	4.03
Evaluator 3	3.43	3.67	3.67
Average	3.60	3.67	3.80

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Fine-Tuning Large Language Models for Kazakh Text Simplification

Content area