Content area

Abstract

This study presents the first systematic evaluation of in-context learning for Tarifit machine translation, a low-resource Amazigh language spoken by 5 million people in Morocco and Europe. We assess three large language models (GPT-4, Claude-3.5, PaLM-2) across Tarifit–Arabic, Tarifit–French, and Tarifit–English translation using 1000 sentence pairs and 5-fold cross-validation. Results show that 8-shot similarity-based demonstration selection achieves optimal performance. GPT-4 achieved 20.2 BLEU for Tarifit–Arabic, 14.8 for Tarifit–French, and 10.9 for Tarifit–English. Linguistic proximity significantly impacts translation quality, with Tarifit–Arabic substantially outperforming other language pairs by 8.4 BLEU points due to shared vocabulary and morphological patterns. Error analysis reveals systematic issues with morphological complexity (42% of errors) and cultural terminology preservation (18% of errors). This work establishes baseline benchmarks for Tarifit translation and demonstrates the viability of in-context learning for morphologically complex low-resource languages, contributing to linguistic equity in AI systems.

Full text

Turn on search term navigation

1. Introduction

1.1. Background

The vast majority of the world’s approximately 7000 languages are considered low-resource, lacking the extensive digital data and computational tools available for widely spoken languages [1]. This linguistic gap in natural language processing (NLP) research creates a significant technological divide, limiting digital inclusion and the preservation of linguistic diversity [2].

Among these underrepresented languages are the Amazigh (Berber) languages, spoken across North Africa by millions. Tarifit, a prominent Northern Berber variety, is spoken by approximately 5 million people, primarily in the Rif region of Morocco, and by a significant diaspora in various European countries [3]. Despite its substantial number of speakers and rich cultural heritage, Tarifit suffers from a profound lack of digital linguistic resources and computational tools [4,5].

1.2. Challenges and Motivations

Traditional neural machine translation approaches require large parallel corpora, which are virtually non-existent for Tarifit and most other low-resource languages. This limited resource landscape significantly impedes the development of robust NLP applications, such as machine translation (MT), for Tarifit. Recent work has explored transfer learning and multilingual models [6,7], but these approaches still require substantial training data and computational resources that remain inaccessible for most endangered languages.

The advent of Large Language Models (LLMs) has introduced a paradigm shift through in-context learning (ICL), where models can perform translation tasks using only a few demonstration examples provided in the prompt, without requiring parameter updates or extensive training data [8]. This capability offers unprecedented opportunities for low-resource machine translation, as it can leverage linguistic resources such as dictionaries, grammar books, and small sets of parallel examples that are often available for well-documented but digitally under-resourced languages like Tarifit.

1.3. Literature Survey

Recent studies have demonstrated the potential of in-context learning for machine translation of endangered languages. Zhang et al. [9] showed that incorporating linguistic descriptions in prompts significantly improves translation quality for critically endangered languages. Tanzer et al. [10] established benchmarks for learning translation from grammar books alone, while Hus and Anastasopoulos [11] explored the integration of grammatical knowledge in few-shot translation. Pei et al. [12] provided comprehensive analysis of in-context machine translation for low-resource languages through a case study on Manchu, demonstrating the effectiveness of ICL for morphologically complex languages.

For Tarifit specifically, El Ouahabi et al. [13] developed automatic speech recognition systems for Amazigh-Tarifit, while Boulal et al. [14] explored data augmentation techniques using convolutional neural networks for Amazigh speech recognition. However, systematic evaluation of ICL strategies for Berber languages, particularly regarding optimal shot selection, prompt engineering, and cross-lingual transfer patterns, remains unexplored.

1.4. Contributions and Novelties

This study addresses this critical gap by providing the first comprehensive investigation of in-context learning for Tarifit machine translation. We systematically evaluate the impact of various ICL components: shot selection strategies, prompt formulation techniques, and linguistic context integration across three language pairs (Tarifit–Arabic, Tarifit–French, Tarifit–English) that reflect the multilingual environment of Tarifit speakers. Through controlled experiments using state-of-the-art LLMs (GPT-4, Claude-3.5, PaLM-2) and a carefully curated dataset of 1000 sentences, we establish baseline performance metrics and identify optimal ICL configurations for this morphologically complex, low-resource language.

Our contributions include: (1) the first systematic evaluation of ICL for Tarifit translation, establishing baseline benchmarks across multiple language pairs; (2) comprehensive analysis of shot selection strategies and their impact on translation quality; (3) identification of optimal ICL configurations for morphologically rich, low-resource languages; and (4) practical recommendations for developing effective few-shot translation systems for endangered Berber languages. These findings advance our understanding of ICL capabilities for low-resource MT and provide a methodological framework for similar investigations in other underrepresented language families.

This study establishes baseline performance metrics for Tarifit ICL translation while acknowledging important methodological constraints. Our evaluation focuses on Latin script representation and unidirectional translation (Tarifit as source), with a dataset of 1000 sentences that, while substantial for low-resource language research, remains modest for comprehensive linguistic coverage. These limitations, detailed in Section 6, define the scope of our findings and highlight critical directions for future work in Berber language processing.

This study addresses three key research questions: RQ1: What is the optimal number of demonstration examples (shot count) for Tarifit ICL translation across different target languages? RQ2: How do different shot selection strategies (random, similarity-based, diversity-based) impact translation quality for morphologically complex low-resource languages? RQ3: To what extent does linguistic proximity between Tarifit and target languages (Arabic, French, English) influence ICL translation performance?

1.5. Organization of This Paper

The remainder of this paper is organized as follows: Section 2 reviews related work in Amazigh language processing and in-context learning for low-resource translation. Section 3 provides essential background on the Tarifit language, including its geographic distribution, writing systems, and computational challenges. Section 4 details our methodology, including the ICL framework, dataset construction, and evaluation protocols. Section 5 presents comprehensive experimental results. Finally, Section 6 discusses findings and implications with future research directions.

2. Related Work

The computational processing of Amazigh languages has seen a notable increase in scholarly attention in recent years, though research efforts remain distributed across various Amazigh varieties and natural language processing (NLP) tasks. This section reviews existing literature on Amazigh language processing, machine translation, and the emerging field of in-context learning for low-resource languages, with a particular focus on studies relevant to Tarifit.

2.1. Amazigh Language Processing and Machine Translation

Early computational work on Amazigh languages laid essential groundwork by addressing foundational NLP tasks. Outahajala et al. [15,16] established initial text processing methodologies, while Boulaknadel and Ataa Allah [17] developed standardized Amazigh corpora. A significant milestone was the creation of the first parallel multilingual corpus of Amazigh by Ataa Allah and Miftah [18], providing critical infrastructure for subsequent machine translation (MT) research.

Despite these foundational efforts, MT research for Amazigh languages remains relatively limited compared to other NLP domains. Taghbalout et al. [6] introduced a UNL-based approach for Moroccan Amazigh, albeit with restricted vocabulary. A more recent advancement was presented by Maarouf et al. [7], who developed the first transformer-based English-to-Amazigh translation system, demonstrating the potential of neural approaches, although their evaluation was conducted on small parallel corpora. Diab et al. [19] explored guided back-translation techniques for Kabyle–French, representing one of the few studies addressing Berber–European language pairs.

The inherent morphological complexity of Amazigh languages poses substantial challenges for MT systems. To address this, Nejme et al. [20,21] developed finite-state morphological analyzers, and Ammari and Zenkoua [22] contributed specialized work on pronominal morphology. These developments have been crucial in building the computational foundations necessary for advanced NLP applications for Amazigh.

2.2. LLM-Based In-Context Machine Translation

The emergence of large language models has revolutionized machine translation for low-resource languages through in-context learning capabilities. Brown et al. [8] first demonstrated that large language models could perform translation tasks using only a few demonstration examples in the prompt, without requiring parameter updates or fine-tuning.

Building on this foundation, several studies have specifically investigated ICL for low-resource and endangered languages. Lin et al. [23] explored few-shot learning with multilingual generative models, demonstrating effectiveness across various language pairs. Vilar et al. [24] provided systematic evaluation of prompting strategies for translation, establishing best practices for prompt engineering and shot selection.

Dictionary-based approaches have shown particular promise for low-resource ICL. Ghazvininejad et al. [25] introduced dictionary-based phrase-level prompting, showing significant improvements when lexical information is incorporated into prompts. Elsner and Needle [26] demonstrated effective translation of a low-resource language using GPT-3 with human-readable dictionaries, highlighting the potential of combining linguistic resources with few-shot learning.

Recent work has specifically targeted endangered and critically low-resource languages. Zhang et al. [9] showed that incorporating linguistic descriptions in prompts significantly improves LLM performance on endangered languages, while Zhang et al. [27] explored teaching LLMs unseen languages through in-context examples. Tanzer et al. [10] established benchmarks for learning translation from grammar books alone, providing systematic evaluation of grammatical knowledge integration in ICL.

Grammar-based approaches have yielded mixed results. Hus and Anastasopoulos [11] explored translation using grammar books, finding that while grammatical information can be helpful, its integration requires careful prompt engineering. Merx et al. [28] conducted a comprehensive study on Mambai, demonstrating the effectiveness of retrieval-augmented prompting for extremely low-resource languages.

However, systematic evaluation of ICL strategies specifically for Berber languages remains absent from the literature. Most existing work has focused on individual language cases without cross-linguistic analysis, and none has addressed the particular challenges posed by the morphological complexity and multilingual context characteristic of Tarifit and related Berber languages.

2.3. Tarifit-Specific Research and Digital Linguistic Landscape

Computational research specifically targeting Tarifit is notably sparse. Recent pioneering work includes Awar.ai [3], which introduced the first automatic speech recognition system designed specifically for Tarifit. Most relevant to the current study, Tahiri [29] conducted an in-depth analysis of word boundaries in Standard Amazigh writing through the lens of Tarifit Facebook users. Her findings highlighted significant orthographic variation, inconsistent digital writing practices, mixed script usage (Latin, Tifinagh, Arabic), and irregular tokenization patterns. These observations are directly pertinent to translation evaluation, as they underscore the complexity of real-world Tarifit text and the challenges it presents to automated systems.

The broader sociolinguistic context of Tarifit is also crucial for understanding its computational challenges. Aissati et al. [4] examined Amazigh language policy in Morocco, revealing the complex multilingual environment where Tarifit speakers frequently engage in code-switching between Tarifit, Arabic, and French. Ait Laaguid and Khaloufi [5] further documented similar multilingual practices in social media contexts, illustrating dynamic linguistic mixing that poses considerable challenges for automated translation systems.

These issues are further exacerbated by the broader challenge of linguistic underrepresentation in NLP. Joshi et al. [1] highlighted the critical underrepresentation of languages like Tarifit in mainstream NLP systems. Ataa Allah and Boulaknadel [30] surveyed emerging trends in less-resourced language processing, identifying key challenges specific to various Amazigh varieties.

Our current study addresses critical gaps within this research landscape by providing the first systematic evaluation of in-context learning for Tarifit translation, establishing baseline performance metrics and optimal ICL configurations for this underrepresented but culturally significant Berber language.

3. Tarifit Language Background

Tarifit (ISO 639-3: rif), also known as Northern Berber or Rifian, is a Berber language belonging to the Afroasiatic language family. The Tarifit language community represents a significant yet underserved linguistic group in the digital age. With 5 million speakers worldwide, including 3 million in Northern Morocco and major communities across Europe—notably in Belgium (700,000), Netherlands (600,000), France (300,000), and Spain (220,000)—this vibrant community lacks access to modern language technologies that many other languages take for granted [3].

3.1. Geographic Distribution and Multilingual Context

Tarifit is primarily spoken in the Rif region of Northern Morocco, encompassing provinces such as Al Hoceima, Nador, and parts of Taounate and Taza. However, the language extends far beyond Morocco’s borders through substantial diaspora communities established through decades of migration. As detailed in Table 1, the global distribution of Tarifit speakers creates a complex multilingual landscape with varying contact languages across different regions.

This geographic distribution creates a complex sociolinguistic landscape where Tarifit speakers regularly engage in code-switching between Tarifit, Arabic, and French in Morocco, or between Tarifit and European languages in diaspora communities. The multilingual competence of Tarifit speakers presents both opportunities and challenges for machine translation systems, as natural Tarifit discourse often contains lexical borrowings and code-mixed segments that automated systems must handle appropriately. The choice of Arabic, French, and English as target languages in our study directly reflects this multilingual reality, covering the primary contact languages across different Tarifit-speaking regions.

3.2. Writing Systems and Orthographic Variation

Tarifit can be written using multiple script systems: Tifinagh (the traditional Berber script), Latin script, Berber Latin script, or Arabic letters, reflecting the diverse literacy practices and historical influences within the community [3]. Table 2 illustrates this orthographic diversity through a simple greeting, demonstrating how the same linguistic content can appear in multiple written forms depending on the context and community practices.

Tifinagh represents the traditional indigenous script, increasingly promoted in educational and cultural contexts as part of Berber language revitalization efforts. The Latin-based orthographies are most common in digital contexts and educational materials, while Arabic script usage reflects the broader Arabic literacy in Morocco. This orthographic diversity poses significant challenges for computational processing, as the same linguistic content may appear in multiple scripts depending on the writer’s background, intended audience, and platform.

For the purposes of this study, we focus exclusively on Latin script representation of Tarifit, as it is the most prevalent form in digital communication and online resources from which our dataset is constructed. This methodological choice allows for consistent preprocessing and evaluation while avoiding the additional complexity of cross-script normalization, though we acknowledge that a comprehensive Tarifit NLP system would ultimately need to handle all script variants. The lack of standardized orthographic conventions across different Tarifit-speaking communities further complicates automated text processing, creating additional preprocessing challenges for machine translation systems, particularly when training data or evaluation metrics must account for multiple valid representations of the same linguistic content.

3.3. Linguistic Features and Computational Challenges

Tarifit exhibits the rich agglutinative morphology characteristic of Berber languages, with complex verbal inflection systems that mark person, number, gender, tense, aspect, and mood through prefixes, suffixes, and internal vowel alternations. This morphological complexity, combined with relatively free word order and extensive use of clitics, presents substantial challenges for automated parsing and translation. Traditional rule-based approaches struggle with the combinatorial complexity of morphological variations, while neural approaches require large training corpora that are unavailable for Tarifit.

The language demonstrates significant lexical borrowing from Arabic due to centuries of contact, creating cognates and shared vocabulary that may facilitate cross-lingual transfer in machine translation contexts, particularly for the Tarifit–Arabic language pair. However, this lexical overlap also introduces false friends and semantic shifts that can mislead automated systems. Additionally, the limited availability of digital corpora and standardized linguistic resources severely constrains the development of traditional data-driven NLP applications, making in-context learning approaches particularly valuable for this language community. The sparse digital presence and inconsistent writing practices documented in recent studies [29] further underscore the need for robust few-shot learning methodologies that can operate effectively with minimal training data.

Figure 1 illustrates the complex morphological structure characteristic of Tarifit through representative examples that demonstrate the agglutinative nature of the language and the computational challenges it presents for automated processing.

4. Methodology

4.1. In-Context Learning Framework

In-context learning enables large language models to perform translation tasks using only demonstration examples provided in the prompt, without parameter updates [8]. This paradigm is particularly promising for low-resource languages like Tarifit, where traditional neural machine translation approaches are hindered by the scarcity of parallel training data [7]. For a given Tarifit sentence x, the ICL translation process is formalized as:

(1)y^=LLM(π(D,x))

where y^ is the predicted translation, LLM(·) represents the large language model, and π(D,x) is the prompt construction function combining demonstration examples D={(x1,y1),(x2,y2),,(xk,yk)} with the input sentence x. Each (xi,yi) pair consists of a Tarifit sentence and its corresponding translation.

The prompt construction follows the standard ICL framework where the context C contains task instructions and demonstration examples. Complete prompt templates for all language pairs are provided in Appendix A.1.

(2)C={I,s(x1,y1),s(x2,y2),,s(xk,yk),x}

Here, I represents task instructions, s(xi,yi) represents formatted demonstration examples, and x is the input sentence to be translated. Figure 2 illustrates the complete ICL pipeline for Tarifit translation.

4.2. Dataset Construction

We constructed a comprehensive dataset of 1000 Tarifit sentences with parallel translations in Arabic, French, and English. The dataset was carefully designed to capture the linguistic diversity and cultural richness of Tarifit across multiple domains and contexts (Table 3). The stratified sampling approach ensures representation of different linguistic phenomena, including morphological complexity, lexical borrowing from Arabic, and code-switching patterns characteristic of natural Tarifit discourse.

The dataset is available for research purposes upon request through our institutional ethics committee, with data collection methodology detailed in Appendix A.4.

All texts use Latin script representation following our methodological scope (Section 3.2). Reference translations were produced by three qualified native Tarifit speakers fluent in the target languages, emphasizing semantic accuracy and cultural appropriateness. Translation guidelines prioritized preserving cultural nuances and idiomatic expressions while maintaining natural target language fluency.

4.3. Shot Selection Strategies

The selection of appropriate demonstration examples is crucial for ICL performance. We evaluate three distinct shot selection approaches, each representing different strategies for optimizing the demonstration set composition.

4.3.1. Random Selection

Our baseline approach randomly samples k examples from the available parallel corpus:

(3)Drandom=UniformSample(P,k)

This strategy provides a control condition for evaluating the effectiveness of more sophisticated selection methods.

4.3.2. Similarity-Based Selection

This approach selects demonstrations most semantically similar to the input sentence using cosine similarity between multilingual sentence embeddings:

(4)sim(x,xi)=emb(x)·emb(xi)||emb(x)||·||emb(xi)||

This similarity measure computes how semantically related two sentences are by comparing their vector representations in a high-dimensional space. Higher values (closer to 1) indicate more similar meaning, while lower values (closer to 0) suggest different semantic content. This approach ensures that demonstration examples share semantic characteristics with the input sentence, potentially improving the model’s ability to recognize relevant translation patterns.

(5)Dsim=argmaxDP,|D|=k(xi,yi)Dsim(x,xi)

We employ multilingual sentence embeddings to compute semantic similarity, enabling effective cross-lingual demonstration selection. The complete algorithm implementation can be found in Algorithm A1 (Appendix A.5).

4.3.3. Diversity-Based Selection

This strategy ensures that demonstrations span different linguistic patterns while maintaining relevance to the input sentence, balancing similarity and coverage of morphological variations. The approach prioritizes examples that collectively cover diverse grammatical structures, vocabulary domains, and sentence lengths to provide comprehensive linguistic context for the model.

4.4. Model Configuration

We evaluate three state-of-the-art large language models: GPT-4, Claude-3.5, and PaLM-2. These models span diverse architectures and parameter sizes, offering a representative comparison of modern LLM capabilities.

All models use temperature = 0 for deterministic outputs, ensuring reproducible results across experimental runs. We systematically vary the number of demonstration examples k{1,3,5,8,10,15} to identify optimal shot counts for each target language and model combination. This range covers the spectrum from few-shot to many-shot learning scenarios within typical context window constraints.

Table 4 presents the technical specifications of the three LLMs selected for this study. These models were chosen to represent different architectural approaches and parameter scales, providing comprehensive coverage of current state-of-the-art capabilities for in-context learning tasks.

4.5. Evaluation Framework

4.5.1. Automatic Metrics

We employ multiple automatic evaluation metrics to capture different aspects of translation quality:

BLEU: Standard n-gram precision metric measuring lexical overlap:

(6)BLEU=BP×expn=1414logpn

chrF: Character-level F-score particularly suitable for morphologically rich languages:

(7)chrF=(1+β2)×chrP×chrRβ2×chrP+chrR

BERTScore: Semantic similarity using contextual embeddings:

(8)BERTScore=1|x|xixmaxyjyemb(xi)·emb(yj)

The combination of these metrics provides complementary perspectives on translation quality, with BLEU capturing lexical fidelity, chrF addressing morphological variations, and BERTScore measuring semantic preservation.

4.5.2. Human Evaluation

A subset of 200 translations undergoes evaluation by qualified native speakers using 5-point Likert scales for adequacy and fluency assessment. Comprehensive evaluation guidelines, criteria, and evaluator qualifications are specified in Appendix A.2. Inter-annotator agreement is measured using Krippendorff’s alpha to ensure evaluation reliability.

4.5.3. Cross-Validation Protocol

We employ 5-fold cross-validation, where 1000 sentences are partitioned into 5 folds of 200 sentences each. For each fold, 800 sentences serve for shot selection and 200 for evaluation:

(9)Performance(M,S,k,L)=15f=15Evaluate(M,S(Df,k),Tf,L)

where M is the model, S is the shot selection strategy, k is the number of shots, L is the target language, Df is the demonstration set for fold f, and Tf is the test set for fold f. This protocol ensures robust performance estimates while maximizing the use of our limited dataset.

5. Results

This section presents the systematic evaluation of in-context learning performance for Tarifit translation across three target languages using multiple large language models. All experiments were conducted following the 5-fold cross-validation protocol described in Section 4, with results averaged across folds to ensure robustness.

5.1. Model Performance and Cross-Lingual Analysis

Table 5 presents the translation performance across all model and language combinations using our optimal configuration (8-shot similarity-based selection). The results reveal substantial performance variations both across models and target languages.

Cross-Lingual Performance Patterns: The results demonstrate clear performance hierarchies across target languages. GPT-4 achieved the highest scores for Tarifit→Arabic translation (BLEU: 20.2), followed by Tarifit→French (BLEU: 14.8) and Tarifit→English (BLEU: 10.9). This pattern aligns with our hypothesis regarding linguistic proximity effects, where extensive lexical borrowing and shared vocabulary between Tarifit and Arabic facilitate cross-lingual transfer.

Model-Specific Analysis: GPT-4 emerged as the strongest performer overall, achieving the best scores across all three language pairs. The performance gap between the best and worst-performing models was most pronounced for Tarifit→English, with a difference of 2.8 BLEU points. All models consistently followed the Arabic > French > English performance hierarchy.

Linguistic Proximity Effects: The Tarifit→Arabic language pair consistently outperformed other directions across all models, with an average improvement of 8.4 BLEU points over Tarifit→English. This advantage can be attributed to shared Semitic substrate influences, extensive Arabic lexical borrowing in Tarifit, and similar morphological patterns.

5.2. In-Context Learning Optimization

Figure 3 illustrates the relationship between shot count and translation performance across different target languages and models. Our analysis identifies k = 8 as the optimal number of demonstration examples across most model–language combinations. Performance improvements plateau beyond this point, with slight degradation observed at k = 15, likely due to context window limitations.

Table 6 compares shot selection approaches across all evaluation metrics. Similarity-based selection consistently achieved the highest performance across all language pairs and metrics, with an average improvement of 2.1 BLEU points, 3.3 chrF points, and 3.1 BERTScore points over random selection. The effectiveness of this strategy is evident across lexical (BLEU), morphological (chrF), and semantic (BERTScore) dimensions of translation quality.

Figure 4 provides a visual comparison of these shot selection strategies, clearly illustrating the consistent superiority of similarity-based selection across all three target languages, with Arabic maintaining the highest performance levels followed by French and English.

Bootstrap confidence intervals (95% CI) confirm that performance differences between optimal and suboptimal configurations are statistically significant (p < 0.05) for all model–language combinations. (see Appendix A.6 for detailed statistical analysis procedures).

5.3. Error Analysis and Human Evaluation

Human evaluation was conducted on 200 translations (67 per target language) using the optimal configuration. Three qualified native Tarifit speakers evaluated translations using 5-point Likert scales (Table 7).

Inter-annotator agreement measured by Krippendorff’s alpha was α=0.69 for adequacy and α=0.65 for fluency, indicating substantial agreement. The correlation between automatic metrics and human judgments was strongest for chrF (r = 0.84), while BLEU showed weaker correlation (r = 0.72) (Figure 5). Representative examples of human evaluation outcomes across different translation quality levels are provided in Appendix A.3, illustrating the range of translation challenges encountered and the evaluation criteria applied.

Systematic error analysis reveals distinct patterns across target languages (Figure 6): Morphological Errors: 42% of errors involved incorrect handling of Tarifit’s agglutinative morphology, most frequent in English translations (48%).

Lexical Errors: 28% of errors involved lexical choices. Arabic translations showed fewer lexical errors (19%) compared to French (31%) and English (39%).

Cultural Errors: 18% of errors involved mistranslation of culture-specific terms, relatively consistent across languages (15–22%).

Code-Switching: 12% of errors occurred in code-switched segments. Arabic translations handled these most effectively (28% success rate vs. 8% for French and 6% for English).

The error analysis reveals that while current LLMs demonstrate promising capabilities for Tarifit translation, systematic challenges remain in morphological processing and cross-linguistic transfer, particularly for typologically distant target languages. The increased dataset size of 1000 sentences provides more robust error pattern identification and demonstrates the persistent challenges in automated processing of morphologically complex low-resource languages.

6. Discussion and Conclusions

6.1. Discussion

This study presents the first systematic evaluation of in-context learning for Tarifit machine translation, establishing baseline performance metrics and identifying optimal ICL configurations for this underrepresented Berber language. Our key contributions include: (1) demonstration that linguistic proximity significantly enhances ICL performance, with Arabic translations outperforming French and English by substantial margins; (2) identification of k = 8 as the optimal shot count and similarity-based selection as the most effective demonstration strategy; (3) comprehensive error analysis revealing systematic challenges in morphological processing and cultural preservation; and (4) validation of chrF as a more appropriate evaluation metric than BLEU for morphologically complex languages.

Our findings have immediate practical implications for developing translation tools for the 5 million Tarifit speakers worldwide, particularly in multilingual contexts where Arabic–Tarifit translation can serve as a bridge for accessing digital resources. The demonstrated effectiveness of ICL approaches, despite performance limitations, provides a viable path forward for developing NLP tools for other low-resource Berber languages, circumventing the data scarcity challenges that have historically limited computational linguistics research in this language family.

The realistic performance benchmarks established in this study (20.2 BLEU for Arabic, 14.8 for French, 10.9 for English using GPT-4) provide a foundation for future research and set appropriate expectations for practical deployment. While these scores indicate that significant challenges remain, particularly in morphological processing and cross-linguistic transfer, they represent substantial progress for a language with virtually no prior computational resources.

6.2. Societal and Educational Implications

The development of effective machine translation for Tarifit has significant implications beyond computational linguistics. For the 5 million Tarifit speakers worldwide, access to translation technology can bridge communication gaps in multilingual contexts, particularly for diaspora communities maintaining connections with their linguistic heritage.

In educational settings, these tools can support Tarifit language preservation efforts by enabling content translation for educational materials and facilitating bilingual education programs. The demonstrated effectiveness of ICL approaches suggests viable pathways for developing similar technologies for other endangered Berber languages, contributing to broader linguistic diversity preservation in the digital age.

Furthermore, the accessibility of ICL methods—requiring minimal computational resources compared to traditional neural machine translation—makes this technology more democratically available to language communities that lack extensive technical infrastructure. This democratization of language technology represents a step toward more equitable representation in artificial intelligence systems. These findings align with patterns observed in ICL research for other morphologically complex languages like Manchu [12] while revealing the unique influence of historical language contact patterns in Berber language processing, where Arabic proximity effects (8.4 BLEU advantage) exceed typical related-language improvements in ICL studies.

6.3. Future Work

Several research directions emerge from our findings. First, extending our methodology to other Berber languages (Tamazight, Tashelhiyt, Kabyle) would validate the generalizability of our ICL optimization strategies and linguistic proximity effects. Second, investigating bidirectional translation capabilities, particularly Arabic→Tarifit, could provide insights into the asymmetries of cross-lingual transfer. Third, developing multilingual ICL approaches that simultaneously leverage multiple target languages could improve overall translation quality through cross-lingual knowledge sharing.

Technical extensions should explore the integration of morphological analyzers and cultural knowledge bases into ICL prompts, potentially addressing the systematic errors identified in our analysis. Additionally, investigating prompt engineering techniques specifically designed for morphologically rich languages could further improve performance. Finally, comprehensive comparison with fine-tuning approaches, when sufficient computational resources permit, would establish the relative merits of ICL versus traditional neural machine translation methods for low-resource languages.

Practical applications should focus on developing user-friendly interfaces that appropriately communicate translation confidence levels and limitations to end users. Given the performance constraints identified, hybrid approaches combining ICL with human post-editing or community-driven correction mechanisms may prove most effective for real-world deployment.

Our work contributes to the broader goal of linguistic equity in artificial intelligence systems, demonstrating that state-of-the-art language models can be effectively adapted for underrepresented languages through carefully designed in-context learning approaches, albeit with realistic performance expectations. As the field moves toward more inclusive NLP technologies, our methodology provides a template for developing translation capabilities for the thousands of low-resource languages that lack sufficient data for traditional machine translation approaches.

Author Contributions

Conceptualization, O.A. and K.F.; methodology, O.A.; investigation, O.A.; resources, O.A.; data curation, O.A.; writing—original draft preparation, O.A.; writing—review and editing, K.F.; visualization, O.A.; supervision, K.F.; project administration, O.A. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

Data are available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Correction Statement

This article has been republished with a minor correction to the Data Availability Statement. This change does not affect the scientific content of the article.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 Tarifit morphological structure examples. The diagram illustrates key morphological patterns in Tarifit: (a) circumfix negation wa…cha surrounding the verb phrase in “he will not come”, (b) complex negation with locative and existential verb in “why doesn’t he exist here?” (c) feminine circumfix ta-…-t surrounding the root in “woman”, and (d) interrogative construction in “what time?” Color coding indicates grammatical markers (red), root words (green), verbs/nouns (blue), and question words (yellow). These circumfix patterns demonstrate the computational challenges posed by Tarifit’s discontinuous morphemes for machine translation systems.

Figure 2 In-context learning framework for Tarifit translation. The diagram shows the six-step process: (1) Tarifit input sentence, (2) shot selection strategy with corpus and target language information, (3) demonstration example selection, (4) prompt construction, (5) LLM processing, and (6) translation output. The evaluation framework employs automatic metrics (BLEU, chrF, BERTScore), human evaluation, and 5-fold cross-validation.

Figure 3 Translation Performance vs. Shot Count Across Models and Languages. The graphs demonstrate that performance consistently improves from 1-shot to 8-shot configurations across all model–language combinations, with diminishing returns beyond this point. Arabic translations (blue lines) show the steepest improvement curves and highest peak performance, while English translations (red lines) exhibit more modest gains. The plateau effect after 8 shots suggests optimal context utilization, with slight degradation at 15 shots indicating context window limitations. GPT-4 (solid lines) consistently outperforms other models, maintaining larger performance gaps for typologically distant language pairs.

Figure 4 Shot selection strategy performance.

Figure 5 Automatic vs. human evaluation correlation.

Figure 6 Error type distribution by language.

Tarifit speaker distribution by region.

Region/Country Speakers Primary Contact Languages
Northern Morocco 3,000,000 Arabic, French
Belgium 700,000 Dutch, French
Netherlands 600,000 Dutch
France 300,000 French
Spain 220,000 Spanish, Catalan
Other Europe 180,000 Various
Total 5,000,000 -

Example of Tarifit orthographic variation.

Script Text Usage Context
Tifinagh [Image omitted. Please see PDF.] Cultural, digital, academic, and formal contexts
Latin Azul, mlih cha?
Berber Latin Aẓul, mliḥ ca?
Arabic [Image omitted. Please see PDF.]
Translation: “Hello, how are you?”
Additional Examples in Context:
Latin Yossid qbar i thmadith He came before noon
Latin Tamghart ni thsawar tamazight That woman speaks Amazigh
Latin Wanin bo awar ni They don’t say that word

Dataset composition.

Domain Sentences Avg. Length Source
Conversational 500 8.2 Social media, interviews
Literary 330 12.4 Traditional stories, poetry
Cultural 170 15.1 Proverbs, oral traditions
Total 1000 10.3 -

Large language model specifications and access details.

Specification GPT-4 Claude-3.5 PaLM-2
Parameters ∼1.7 T ∼200 B 540 B
Context Window 8192 tokens 200 K tokens 8192 tokens
Access Method OpenAI API (Paid) Anthropic API (Paid) Google API (Free tier)
Temperature 0 0 0
Max Tokens 1500 1500 1500
API Rate Limit 10K RPM 5K RPM 1K RPM

Translation performance across models and target languages. Results show clear performance hierarchies: (1) Arabic consistently outperforms French and English due to linguistic proximity and shared vocabulary, (2) GPT-4 achieves superior performance across all language pairs, with the largest advantage for distant languages, and (3) all models follow the same ranking pattern (Arabic > French > English), indicating robust cross-model linguistic proximity effects. BLEU scores represent lexical overlap, chrF captures morphological accuracy, and BERTScore measures semantic preservation.

Model Tarifit→Arabic Tarifit→French Tarifit→English
BLEU chrF BERT BLEU chrF BERT BLEU chrF BERT
GPT-4 20.2 38.7 69.4 14.8 32.1 61.2 10.9 27.8 56.8
Claude-3.5 18.6 36.3 67.1 13.1 29.6 58.9 9.4 25.2 54.3
PaLM-2 16.9 33.8 64.2 11.7 27.4 56.1 8.1 23.1 51.7

Shot selection strategy performance across all metrics.

Strategy Arabic French English
BLEU chrF BERT BLEU chrF BERT BLEU chrF BERT
Random 17.4 34.2 65.1 11.9 28.7 57.8 7.8 24.1 52.3
Similarity 19.7 37.5 68.2 13.6 31.4 60.5 9.5 26.8 55.1
Diversity 18.8 36.1 66.9 12.7 29.9 59.2 8.9 25.4 53.7

Human evaluation results.

Language Adequacy Fluency BLEU
Arabic 3.4 ± 0.7 3.6 ± 0.6 20.2
French 2.8 ± 0.8 3.0 ± 0.7 14.8
English 2.5 ± 0.9 2.7 ± 0.8 10.9

Appendix A. Experimental Reproducibility Details

Appendix A.1. Sample ICL Prompt Templates

 Tarifit-to-English Translation Prompt: 

Task: Translate the following Tarifit sentences to English.

Examples:

Tarifit: Azul, mlih cha?

English: Hello, how are you?

Tarifit: Ad yas qbar i thmadith

English: He will come before noon

Tarifit: Tamghart ni tsawar tamazight

English: That woman speaks Amazigh

[Additional examples selected based on similarity strategy…]

Now translate:

Tarifit: [INPUT SENTENCE]

English:

 Tarifit-to-French Translation Prompt: 

Task: Translate the following Tarifit sentences to French.

Examples:

Tarifit: Azul, mlih cha?

French: Bonjour comment allez-vous?

Tarifit: Ad yas qbar i thmadith

French: Il viendra avant midi

[Additional examples…]

Now translate:

Tarifit: [INPUT SENTENCE]

French:

Appendix A.2. Human Evaluation Protocol

Evaluator Qualifications:

Native Tarifit speakers

Fluent in target languages (Arabic/French/English)

Linguistic or translation background preferred

Evaluation Criteria:

Adequacy (1–5): Does the translation convey the meaning of the source text?

5: Complete meaning preserved

4: Most meaning preserved, minor gaps

3: Essential meaning preserved

2: Some meaning preserved

1: Little or no meaning preserved

Fluency (1–5): Is the translation natural in the target language?

5: Perfect fluency

4: Good fluency, minor issues

3: Acceptable fluency

2: Disfluent but understandable

1: Very disfluent

Appendix A.3. Sample Human Evaluation Examples

Example 1—High Quality Translation:

Tarifit Source: Azul, mlih cha?

GPT-4 Translation: Hello, how are you?

Reference Translation: Hello, how are you?

Evaluator Scores: Adequacy: 5/5, Fluency: 5/5

Comments: Perfect translation preserving greeting convention

Example 2—Good Translation with Minor Issues:

Tarifit Source: Tamghart ni tsawar tamazight

GPT-4 Translation: That woman speaks Amazigh

Reference Translation: That woman speaks Berber

Evaluator Scores: Adequacy: 4/5, Fluency: 5/5

Comments: Accurate but uses “Amazigh” instead of more common “Berber”

Example 3—Translation with Morphological Error:

Tarifit Source: Netta wa ditis cha

GPT-4 Translation: He will not coming

Reference Translation: He will not come

Evaluator Scores: Adequacy: 3/5, Fluency: 2/5

Comments: Meaning preserved but grammatical error in English

Example 4—Cultural Context Challenge:

Tarifit Source: Chha tsa3at?

GPT-4 Translation: How many hours?

Reference Translation: What time is it?

Evaluator Scores: Adequacy: 2/5, Fluency: 4/5

Comments: Literal translation misses idiomatic time-asking expression

Appendix A.4. Data Collection Methodology

Source Selection: Sentences collected from social media posts, traditional stories, and conversational recordings with speaker consent

Translation Process: Each sentence translated independently by three qualified native speakers

Quality Control: Disagreements resolved through consensus discussion

Cultural Sensitivity: All materials reviewed for cultural appropriateness before inclusion

Appendix A.5. Shot Selection Algorithm

Algorithm 1: Similarity-Based Shot Selection

Input sentence x, corpus C, shot count k

Selected demonstrations D

Compute embedding emb(x) for input sentence

for each sentence siC do

   Compute similarity sim(x,si)=emb(x)·emb(si)||emb(x)||·||emb(si)||

end for

Sort sentences by similarity score (descending)

Select top k sentences as demonstrations D

return D

Appendix A.6. Statistical Analysis Details

Significance Testing: Paired t-tests for performance comparisons

Confidence Intervals: Bootstrap sampling with 1000 iterations

Effect Size: Cohen’s d for practical significance assessment

Multiple Comparisons: Bonferroni correction applied where appropriate

1. Joshi, P.; Santy, S.; Budhiraja, A.; Bali, K.; Choudhury, M. The state and fate of linguistic diversity and inclusion in the NLP world. arXiv; 2020; arXiv: 2004.09095

2. Galla, C.K. Indigenous language revitalization, promotion, and education: Function of digital technology. Comput. Assist. Lang. Learn.; 2016; 29, pp. 1137-1151. [DOI: https://dx.doi.org/10.1080/09588221.2016.1166137]

3. Awar.ai: First Speech Recognition for Tarifit. Available online: https://awar.ai (accessed on 5 May 2025).

4. Aissati, A.E.; Karsmakers, S.; Kurvers, J. ‘We are all beginners’: Amazigh in language policy and educational practice in Morocco. Comp. A J. Comp. Int. Educ.; 2011; 41, pp. 211-227. [DOI: https://dx.doi.org/10.1080/03057925.2011.547289]

5. Ait Laaguid, B.; Khaloufi, A. Amazigh language use on social media: An exploratory study. J. Arbitrer; 2023; 10, pp. 24-34. [DOI: https://dx.doi.org/10.25077/ar.10.1.24-34.2023]

6. Taghbalout, I.; Allah, F.A.; Marraki, M.E. Towards UNL-based machine translation for Moroccan Amazigh language. Int. J. Comput. Sci. Eng.; 2018; 17, pp. 43-54. [DOI: https://dx.doi.org/10.1504/IJCSE.2018.094418]

7. Maarouf, O.; Maarouf, A.; El Ayachi, R.; Biniz, M. Automatic translation from English to Amazigh using transformer learning. Indones. J. Electr. Eng. Comput. Sci.; 2024; 34, pp. 1924-1934. [DOI: https://dx.doi.org/10.11591/ijeecs.v34.i3.pp1924-1934]

8. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. . Language models are few-shot learners. Adv. Neural Inf. Process. Syst.; 2020; 33, pp. 1877-1901.

9. Zhang, K.; Choi, Y.; Song, Z.; He, T.; Wang, W.Y.; Li, L. Hire a linguist!: Learning endangered languages in LLMs with in-context linguistic descriptions. Findings of the Association for Computational Linguistics: ACL 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 15654-15669.

10. Tanzer, G.; Suzgun, M.; Visser, E.; Jurafsky, D.; Melas-Kyriazi, L. A benchmark for learning to translate a new language from one grammar book. Proceedings of the Twelfth International Conference on Learning Representations; Vienna, Austria, 7–11 May 2024.

11. Hus, J.; Anastasopoulos, A. Back to school: Translation using grammar books. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Miami, FL, USA, 12–16 November 2024; pp. 20207-20219.

12. Pei, R.; Liu, Y.; Lin, P.; Yvon, F.; Schütze, H. Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu. arXiv; 2025; arXiv: 2502.11862

13. El Ouahabi, S.; Atounti, M.; Bellouki, M. Toward an automatic speech recognition system for amazigh-tarifit language. Int. J. Speech Technol.; 2019; 22, pp. 421-432. [DOI: https://dx.doi.org/10.1007/s10772-019-09617-6]

14. Boulal, H.; Bouroumane, F.; Hamidi, M.; Barkani, J.; Abarkan, M. Exploring data augmentation for Amazigh speech recognition with convolutional neural networks. Int. J. Speech Technol.; 2024; 28, pp. 53-65. [DOI: https://dx.doi.org/10.1007/s10772-024-10164-y]

15. Outahajala, M.; Zenkouar, L.; Rosso, P.; Martí, A. Tagging amazigh with ancorapipe. Proceedings of the Workshop on Language Resources and Human Language Technology for Semitic Languages; Valletta, Malta, 26 January; 2010; pp. 52-56.

16. Outahajala, M. Processing Amazighe Language. Natural Language Processing and Information Systems, Proceedings of the 16th International Conference on Applications of Natural Language to Information Systems, NLDB 2011, Alicante, Spain, 28–30 June 2011; Proceedings 16 Springer: Berlin/Heidelberg, Germany, 2011; pp. 313-317.

17. Boulaknadel, S.; Ataa Allah, F. Building a standard Amazigh corpus. Proceedings of the Third International Conference on Intelligent Human Computer Interaction (IHCI 2011); Prague, Czech Republic, 29–31 August 2011; Springer: Berlin/Heidelberg, Germany, 2012; pp. 91-98.

18. Allah, F.A.; Miftah, N. The First Parallel Multi-lingual Corpus of Amazigh. Fadoua Ataa Allah J. Eng. Res. Appl.; 2018; 8, pp. 5-12.

19. Diab, N.; Sadat, F.; Semmar, N. Towards Guided Back-translation for Low-resource languages—A Case Study on Kabyle-French. Proceedings of the 2024 16th International Conference on Human System Interaction (HSI); Paris, France, 8–11 July 2024; pp. 1-4.

20. Nejme, F.Z.; Boulaknadel, S.; Aboutajdine, D. Finite state morphology for Amazigh language. Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics; Samos, Greece, 24–30 March 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 189-200.

21. Nejme, F.Z.; Boulaknadel, S.; Aboutajdine, D. AmAMorph: Finite state morphological analyzer for amazighe. J. Comput. Inf. Technol.; 2016; 24, pp. 91-110. [DOI: https://dx.doi.org/10.20532/cit.2016.1002478]

22. Ammari, R.; Zenkoua, A. APMorph: Finite-state transducer for Amazigh pronominal morphology. Int. J. Electr. Comput. Eng.; 2021; 11, 699. [DOI: https://dx.doi.org/10.11591/ijece.v11i1.pp699-706]

23. Lin, X.V.; Mihaylov, T.; Artetxe, M.; Wang, T.; Chen, S.; Simig, D.; Ott, M.; Goyal, N.; Bhosale, S.; Du, J. . Few-shot learning with multilingual generative language models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 9019-9052.

24. Vilar, D.; Freitag, M.; Cherry, C.; Luo, J.; Ratnakar, V.; Foster, G. Prompting PaLM for translation: Assessing strategies and performance. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Toronto, ON, Canada, 9–14 July 2023; pp. 15406-15427.

25. Ghazvininejad, M.; Gonen, H.; Zettlemoyer, L. Dictionary-based phrase-level prompting of large language models for machine translation. arXiv; 2023; arXiv: 2302.07856

26. Elsner, M.; Needle, J. Translating a low-resource language using GPT-3 and a human-readable dictionary. Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology; Toronto, ON, Canada, 14 July 2023; pp. 1-13.

27. Zhang, C.; Liu, X.; Lin, J.; Feng, Y. Teaching large language models an unseen language on the fly. Findings of the Association for Computational Linguistics: ACL 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 8783-8800.

28. Merx, R.; Mahmudi, A.; Langford, K.; de Araujo, L.A.; Vylomova, E. Low-resource machine translation through retrieval-augmented LLM prompting: A study on the Mambai language. Proceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI) @ LREC-COLING 2024; Turin, Italy, 20–24 May 2024; pp. 1-11.

29. Tahiri, N. Word Boundaries in the Writing System of Standard Amazigh: Challenges from Tarifit Facebook Users. The Handbook of Berber Linguistics; Springer: Berlin/Heidelberg, Germany, 2024; pp. 229-253.

30. Ataa Allah, F.; Boulaknadel, S. New trends in less-resourced language processing: Case of Amazigh language. Int. J. Nat. Lang. Comput. (IJNLC); 2023; 12.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.