Investigating Offensive Language Detection in a

Full text

Turn on search term navigation

1. Introduction

Natural language processing (NLP) and machine learning (ML) research can play an important role in moderating content by automatically identifying hate speech and offensive language on social network platforms. This, in turn, facilitates the implementation of robust automatic content moderation strategies. Social media platforms make use of artificial intelligence (AI), including various ML classifiers such as random forests and naive Bayesian networks, to detect and remove offensive and demeaning language [1]. However, unlike languages equipped with abundant linguistic resources, constructing robust models capable of accurately classifying offensive from non-offensive text in the Moroccan Arabic dialect called Darija (MD) poses a challenge due to the scarcity of linguistic assets, primarily labeled datasets. This challenge is further amplified by the distinctive nature of Darija, which is a low-resource language that is also mostly spoken-only, informal and code-switched. In this context, Darija incorporates lexical elements from diverse languages, including Berber, French, and Spanish, while its grammatical structure lacks standardization and writing rules [2].

In MD, linguistics and sociolinguistics reflect the unique blend of languages and cultures in Morocco. Morocco has a long history of being a place where various cultures have interacted and coexisted together, due to its location at the intersection of Africa, Europe, and the Islamic world [3]. As a result, Darija includes a significant number of loanwords from Berber, French, and Spanish, which affect its lexicon and phonetic structure. These borrowed elements can alter the perception and identification of offensive language. Understanding these phonological and morphological features is crucial for developing accurate detection algorithms.

Written Darija does not have an official version and is not taught in any formal educational programs. The syntax of Darija often deviates from Modern Standard Arabic (MSA), with variations in sentence structure and verb conjugation [4]. MD often allows more flexible word order compared to the more rigid structure of MSA. MD also differs from MSA in its use of affixes and suffixes for verb conjugation. In Darija, tense markers are often prefixes like “kan-” for present tense and “-t” for past tense, compared to the more complex system in MSA, which uses distinct prefixes and suffixes for each pronoun. For instance, the verb “to write” in MSA would be “kataba” (he wrote) and “yaktubu” (he writes), while in Darija it becomes “ktb” (he wrote) and “ka-yktb” (he writes). Darija also tends to drop case endings and uses shorter, more colloquial forms than MSA. The method of negation in Darija involves different sentence structure and suffixes. In the example “ana ma fhamtch” (I do not understand), MD uses (ma) and the suffix (-tch) to indicate negation in the past tense, reflecting a different structure and suffixation.

The use of Latin script (Arabizi or Franco-Arabic) is common online for writing Darija due to the accessibility of Latin keyboards and the convenience of typing in informal settings. Numerals are used to represent specific Arabic sounds, such as “3” for “ع” (ayn) and “7” for “ح” (ḥā). Meanwhile, Arabic script remains prevalent in more formal or traditional online contexts, highlighting the dual nature of script use in Morocco. However, there are no standard writing rules for MD.

Sociolinguistically, MD is a language spoken by nearly 37 million Moroccans. MD represents more than just a language; it acts as a cultural identifier, especially in urban areas. In contrast, Berber and French dominate rural and formal sectors. Diglossia is prominent, with Moroccans switching between Darija for informal settings and MSA or French in formal and educational contexts. Darija varies significantly across different regions of Morocco, with distinct local dialects and slang [5]. These regional variations affect how offensive language is used and perceived. The usage of offensive language in Darija can be influenced by social factors such as age, gender, and social status. For instance, what might be considered offensive in one social group may not be perceived the same way in another.

Detecting offensive language is a significant challenge. It goes beyond simply identifying bad words; it requires understanding the context, grasping linguistic nuances, recognizing sentiment and tone, and detecting irony, as well as implicit offensive language. Moreover, the annotated data might suffer from bias because of the personal views of the annotators, as determining what is considered offensive can vary based on cultural and social norms.

In a typical machine learning process, held-out accuracy has traditionally been the primary method for assessing generalization. However, it often overestimates the performance of NLP models because the holdout test set typically shares the same biases as the training data. Hence, there is a need for an alternative for testing the generalization capabilities of the model when confronted with different linguistic data. Moreover, in production systems, the changing behavior of online users over time might cause a data distribution shift, causing a drop in the performance of the ML models. Hence, offensive language detection models should be tested thoroughly to check if they work as intended. Testing ML models can be more complex than testing traditional software [6]. First, there is the issue of data dependency problem, because the results of the models can vary based on the input data and the specific circumstances of its usage [7]. Secondly, the black-box nature inherent to ML models makes it difficult to understand or explain their internal workings. This complexity impedes the identification of the root cause of model weaknesses and hinders the development of the right tests to expose them [8]. Moreover, ML models will usually have their performance degrade over time; they can become stale if they are not part of a continual learning pipeline that is able to adapt to new data distributions, making it difficult to evaluate their long-term reliability. Finally, the issue of limited test coverage is challenging as it is impractical to test a machine learning model on all possible input data [9].

Moreover, unlike languages characterized by the abundance of resources, such as English or German, it is difficult to generate testing data for low resource languages because of the limited availability of tools, such as Part-Of-Speech (POS) taggers, word vectors or embeddings, and Named Entities Recognition (NER) models.

MD is the most spoken dialect in Morocco, and its use in social media is continually increasing. According to Kepios analysis, the number of social media users is increasing and was estimated in 2022 to 63.4% of the total population. Therefore, the regulation of social media content in this low-resource language is a pressing matter and no language should be left behind. In response to these challenges, our work aims to directly address the deficiency of research for processing Darija. The aim of this work is to fill the gap within the literature and address the lack of research in Darija NLP. We present three key contributions that advance this under-explored area. First, we introduce a human-labeled dataset, we fine-tune various language models on the created dataset, and we test the best model by assessing its correctness, robustness and fairness in a black-box manner by using metamorphic testing (MT) and adversarial data. In the context of ML-based systems, correctness represents the ability of a system to produce correct results when provided with valid inputs [10]. Robustness describes the model’s capacity to make correct predictions when provided with unexpected or abnormal inputs [11]. Fairness, on the other hand, represents the ability to make unbiased predictions, without discrimination, based on sensitive attributes such as race, gender, religion, or age [12].

The rest of this article is organized as follows. Section 2 reviews state-of-the art research in the field of offensive language detection and machine learning testing, with a special focus on testing NLP models. Section 3 presents relevant materials and methods and delves into the details of the dataset, the ML models and the testing experiments, providing insights on the process of completing this work. Section 4 discusses and analyzes the results of the research. Finally, in Section 5, we outline some conclusions and directions for future work.

2. Related Work

In this section, we examine the related work in two critical areas. First, we review methodologies and advancements in offensive language and hate speech detection research, including recent developments in the context of Arabic. Second, we evaluate techniques for testing ML systems beyond offline metrics, with a particular emphasis on testing NLP models.

2.1. State-of-the-Art Models in Offensive Language Detection

Recent research has highlighted the effectiveness of various deep learning approaches for detecting offensive language and hate speech. In this context, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) such as Long Short-Term Memory (LSTM) networks, and attention-based models like Bidirectional Gated Recurrent Units with attention layers (Bi-GRU-ATTs) have emerged as leading methodologies [13,14]. The deep learning approaches were reported in the literature as outperforming traditional ML classifiers such Logistic Regression, Gradient Boosting Decision Trees and Support Vector Machines by 13–20% [15]. This notable improvement is attributed to their inherent capacity to capture complex linguistic patterns and contextual information.

Comparing the relative performance of deep learning architectures such as CNNs, LSTMs, and GRUs prior to the emergence of transformers and attention mechanisms in the task of detecting offensive language and hate speech reveals mixed results regarding which approach is the most effective. However, many case studies report that a combination of approaches such as CNNs with an LSTM or CNNs with a GRU can lead to improved results compared to using individual classifiers [16,17]. Typically, the latter approach leverages the strengths of CNNs for capturing word-level features in text, while LSTMs and GRUs perform well at modeling dependencies in sequence data. This trend highlights the importance of experimenting with hybrid deep learning models to optimize performance in complex NLP tasks.

Classically, the deep learning model architectures follow an architectural pattern made of three main components. First, the input is fed into an embedding layer that typically uses word embeddings such as Word2Vec, FastText, or GloVe [18]. Next, deep learning architectures including CNNs, LSTMs, and GRUs are used in the subsequent layers to capture patterns and dependencies; finally, a Softmax layer is used to produce a probability distribution of the output.

The rise of pre-trained language models like BERT (Bidirectional Encoder Representations from Transformers) represents a significant shift in the field of hate and offensive language detection. This shift is evident in its substantial presence, accounting for 38% of deep learning models used in recent years within the research community [15]. This underscores BERT’s important role as a state-of-the-art method in the field. Many case studies have consistently reported that BERT excels in capturing the nuances of offensive language and hate speech, highlighting its superiority over other traditional deep learning architectures, while requiring small labeled datasets [18,19]. We should also note that many BERT models, such as Multilingual BERT (mBERT), have been pre-trained on versatile and large language datasets that are also multilingual, which helped us in addressing the complexities inherent in hate speech and offensive language detection in cross-linguistic contexts [20].

The architectural pattern in solutions that capitalize on BERT is to add task-specific classifiers to fine-tune the model for offensive language detection. Typically, the final hidden state corresponding to the [CLS] token is fed into a dense layer followed by a Softmax layer to predict the offensive content [21]. The approach leverages on the contextual embeddings produced by BERT to improve the accuracy over traditional methods and also reduces the need for large, labeled datasets for the downstream tasks.

In the realm of Arabic text analysis, the application of deep learning techniques faces unique challenges. There are many Arabic dialects and variations which hinder the capacity of models trained on standard Arabic to generalize well [22]. Hence, these models usually need to be fine-tuned using dialect-specific datasets and adaptation approaches. Additionally, the scarcity of large, high-quality labeled datasets in Arabic limits the effectiveness of model pre-training and evaluation. Despite these challenges, recent studies have shown promising results with adaptations of deep learning architectures and pre-trained Arabic BERT models, indicating a growing interest and effort in addressing these linguistic complexities for more robust offensive and hate speech detection solutions in Arabic [23,24,25,26]. In these studies, since there exists no standard Arabic benchmark for offensive language detection, researchers gather their own data to fine-tune their language model. All datasets were gathered from various social media platforms, including Twitter, Facebook, and YouTube. Consequently, the findings may not accurately reflect a language model’s true performance in detecting offensive language, making it difficult to compare different models. Also, data collection and annotation processes can differ from one research study to another. Having a standard corpus will allow for developing more advanced classification systems.

When it comes to the domain of low-resource languages, several researchers have recently tried to address the gap and have shown interest in different languages: Urdu [27], Greek [28], Nepali [29], and Hindi [30]. This became possible by leveraging pre-trained language models or multilingual language models (e.g., mBERT, XLM-R), which are pre-trained on high-resource languages but fine-tuned on low-resource languages with small datasets. Cross-lingual embeddings allow for the transfer of knowledge from resource-rich to resource-poor languages. However, without proper fine-tuning, reliance on pre-trained models can introduce biases from the source data. Overall, more annotated open datasets are needed for diverse low-resource languages, and efforts should focus on creating balanced datasets that include various forms of hate speech and offensive language.

Finally, it is important to note that while the academic models show high accuracy in controlled environments, their practical applications in real-world social network platforms remain limited. There is a need for integrating these models into production pipelines for real-time detection and moderation.

2.2. Testing Machine Learning Models

Testing is a crucial and necessary step in any software development, but it is far more challenging in ML-based software systems due to the inherent uncertainties stemming from their reliance on data, which makes them fail silently. Several studies highlighted the fact that even advanced NLP models can be vulnerable to adversarial attacks [31]. In the context of machine learning, where the input space can be vast and complex, ensuring that the system’s predictions are robust and reliable across various scenarios is crucial. Metamorphic testing is often used in machine learning because it focuses on assessing the model’s behavior in relation to changes in the input data [6]. It ensures the consistency and reliability of machine learning models across various transformations. The basic idea behind this approach is to identify relationships between input and output transformations that should hold true if the system is working correctly, without the need for knowing the exact value of these outputs. Therefore, this approach addresses the challenges associated with the languages’ complex and high-dimensional inputs, and the lack of clear expected outputs [32].

Metamorphic testing has been used to assess different types of properties, such as robustness or fairness. The attacks are generally black-box or white-box with (sometimes) a known target output [6]. In black-box testing, input is provided to the system and output is recorded without a need for accessing the internal workings of the system. This method allows for testing any model without the need to know its structure or implementation. In contrast, the white-box method requires knowledge of the model’s internal structure and implementation, and the created test cases may only be applicable to that specific model [33]. Black-box testing approaches commonly make use of adversarial data to test ML models. Typically, adversarial data are obtained by applying subtle modifications to the input text, small things that normally should not change the outcome for a modified sentence. The objective of testing is to attack the model using the adversarial data and identify the attacks’ success rates. The basic idea behind this approach is to identify how the outputs change according to the inputs’ transformations, without the need for knowing the exact values of these outputs. Therefore, this approach addresses the challenges associated with NLP systems’ complex and high-dimensional inputs, as well as the lack of clear expected outputs.

Several strategies were considered in the literature to generate adversarial data. Typically, attacks are categorized into three groups, character-level, word-level, and sentence-level. For example, Hosseini et al. [34] proposes attacks based on modifications that consist of inserting dots or spaces between letters in a given word, or injecting typos such as by repeating or swapping two letters [35]. Introducing some noise that consists of swapping two consecutive letters and keyboard typo errors has also proven to be an effective strategy for inducing failures in machine translation systems, although the adversarial data can still be comprehensible to humans [36]. Some authors also investigated the impact of punctuation insertion on models’ robustness [37]. The results demonstrate that punctuation insertions, when limited to symbols such as apostrophes and hyphens, are better attacks compared to character insertions.

Alternatively, some researchers prioritize generating adversarial data through word substitutions, modifying texts by replacing certain words. While research on generating adversarial text data is progressing, crafting effective adversarial samples in NLP differs from computer vision because text inputs cannot be modified arbitrarily like image data [38]. Recent advancements in adversarial text generation focus on preserving semantic similarity by replacing words with synonyms or paraphrasing sentences, ensuring that adversarial examples retain the semantic similarity and the syntactic coherence of the original inputs while deceiving models. Replacement techniques, such as substituting words with their synonyms, have been effectively adopted in generating adversarial data for applications such as sentiment analysis [39] and natural language inference [40]. Techniques such as population-based optimization algorithms were also used to select adversarial replacements based on their semantic and syntactic similarity to the original text, enhancing the attack’s effectiveness without sacrificing linguistic quality [38]. Moreover, transformations using the ‘Semantically Equivalent Adversaries’ consist of changing original words in the sentence by semantic-preserving perturbations. Besides the use of synonyms, transformations such as replacing nouns with their pronouns were considered by some research (e.g., “the cat” becomes “it”), or substituting with abbreviations (e.g., “what is” becomes “what’s”). As highlighted in Ribeiro et al. [41], this approach is effective in detecting bugs in three domains, namely sentiment analysis, machine comprehension, and visual question answering. Another category of attacks considered in the literature is ’visual attacks,’ which involve replacing characters or words in the input with visually similar counterparts in an embedding space based on spelling similarity [42]. For example, the word ‘best’ might be substituted with ‘rest’ [43]. Other word transformation methods experimented with in the literature consist of generating adversarial inputs by identifying important words in a text and deleting or replacing those words, as well as by injecting new words that were not part of the original text. The latter approach was applied on a sentiment analysis and a gender detection task, proving its effectiveness [35].

Besides character- and word-level substitutions, some authors considered subword sememe-based substitutions, where a sememe is the smallest unit of meaning in a language [44]. This method allows for crafting high-quality adversarial examples while retaining grammaticality and naturality, leading to a higher attack success rate and outperforming synonym-based substitution methods across multiple datasets and models.

For sentence-level replacement, the work of Tu et al. [45] identifies several categories of metamorphic rules to test four question-answering (QA) systems. These transformations include rephrasing comparative questions, replacing comparative words with their antonyms, or changing the subjects of questions, removing some sentences or reversing their order, and adding or removing irrelevant sentences. A similar idea was explored by Iyyer et al. [46], who suggested using syntactically controlled paraphrasing networks (SCPNs) to generate adversarial examples.

There are several methods used in the literature to select words that need to be replaced. One approach targets the most important words which have a high likelihood to influence the predictions of the model [31]. Another method, proposed by Ren et al. [47], is called ‘Probability Weighted Word Saliency’, where saliency refers to the degree of change in the output probability of the classifier if the word is replaced with “unknown”. Other researchers opted for scoring the words according to their importance, with the most crucial word exerting the greatest impact on the label in the original text [43]. The importance of words is calculated by evaluating the predictions using three methods, namely by considering the sentence head up to the word of interest, the sentence tail after the word, or the sentence being stripped from the word.

Metamorphic testing with adversarial data has been applied in a wide spectrum of applications. For example, it has been used to assess relationships in natural language understanding models [48], evaluate robustness in toxicity detection systems [34], examine fairness in language translation models [49], test performance in question-answering systems [45], evaluate entity recognition models [50], and assess relation extraction models [51]. Moreover, metamorphic testing can also help with evaluating the quality of data as incorrectly generated data, such as through translation errors, which can lead to poor models and potential prediction errors [52]. Furthermore, Ren et al. [47] examined the problem of adversarial attacks on text classification, which has received less attention compared to image classification. Overall, all case studies reported that metamorphic testing using adversarial data uncovered several vulnerabilities of the model as it can comprehensively test the systems’ behavior by generating inputs that are likely to uncover issues [32]. Paradoxically, even if a given model succeeds in adhering to all metamorphic relations, it does not guarantee the absence of bugs in ML systems. In a sentiment analysis case study [53], the authors highlight the concept of ‘false satisfactions’ and show that 20–50% of the bugs may remain undetectable through the metamorphic testing approach.

Baseline results on robustness testing for sentiment analysis models, particularly in offensive language detection, reveal critical vulnerabilities when models are attacked with adversarial data. For instance, Wang et al. observed that models such as BERT and LSTMs encountered attack success rates of 54–58% in black-box testing corresponding to word perturbations [54]. Tsai et al. reported a failure rate in sentiment classification methods based on CNNs when subjected to word-level edits that ranges between 65 to 72% [55]. Furthermore, Rusert et al. showed a 67% accuracy drop against state-of-the-art offensive language classifiers under word substitution attacks involving context-aware embeddings, indicating a significant performance drop while preserving the original meaning of the text [56]. Similarly, Ribeiro et al. showed that NLP models used in commercial production systems by large software companies are also vulnerable to critical bugs when subjected to adversarial attacks. By adopting principles from behavioral testing in software engineering, some attacks achieved success rates exceeding 90% in some cases [57]. In Arabic NLP, transformer-based models for detecting offensive language were found to be vulnerable in response to adversarial attacks with an up to 30% success rate by substituting a single word and preserving the semantic and syntactic structure [58].

Research in fairness testing for NLP models reveals significant findings regarding adversarial examples and biases. One study employing the CheckList tool indicates that models like BERT demonstrate high error rates when classifying abusive language related to minorities, with failure rates exceeding 90% for protected attributes such as race and sexual orientation [59]. For example, failure rates for perturbing race, nationality, and religion were 94%, 33.2%, and 90.8%, respectively. Another research study that evaluated several sentiment analysis production systems by big companies also identified model vulnerabilities to adversarial perturbations for fairness testing [57]. For example, issues like gender stereotypes and neutral statements related to feminism still showed high failure rates, with errors of up to 76.5%. These results emphasize that while models achieve high accuracy, they fail significantly under fairness-sensitive tests, with adversarial datasets revealing biases in implicit stereotypes.

While significant advancements have been made in offensive language detection using deep learning models, especially in resource-rich languages such as English and German, the research remains limited for low-resource languages such as Moroccan Darija. Despite the success of pre-trained models like BERT in multilingual contexts, existing works often overlook dialect-specific challenges, resulting in a lack of robustness and generalizability in under-resourced languages. Additionally, while adversarial testing has been widely used in NLP, its application for robustness testing in low-resource languages, particularly for offensive language detection, is under-explored. Our work addresses these gaps by creating a human-labeled dataset for MD, fine-tuning a BERT-based model, and employing metamorphic testing with adversarial data to rigorously assess model’s properties. This study contributes novel insights into improving language models for low-resource, dialect-rich languages where data scarcity and linguistic complexity pose significant challenges.

3. Materials and Methods

In this section, we present the Darija offensive dataset collection and preparation, and describe the deep learning model architectures and experiments. This also includes the testing process of the offensive language detection model, adversarial data generation, and the definition of metamorphic relations. The workflow is made of several steps, as shown in Figure 1. In the data collection phase, we gather Moroccan Darija text from different online platforms. In the data annotation phase, we curate and label the collected data for offensive language detection. The next step is about model selection and fine-tuning, where we experiment with various language models, including three Darija-specific language model variants and different types of classifiers. Then, we generate adversarial examples using famous NLP attacks to test the victim model against potential attacks. We also designed metamorphic rules to stipulate how outputs should change in response to specific, systematic changes in the input. We evaluate the model using the adversarial attack success rate metric. Finally, we perform a manual error analysis to understand what makes our model vulnerable and we make recommendations for further refinement and improvement.

3.1. Darija Offensive Language Dataset Collection and Preprocessing

The curated dataset comprises 20,402 Moroccan Darija sentences sourced from diverse outlets, annotated as offensive or inoffensive. The dataset structure consists of two columns: one housing the sentence itself and the other indicating the corresponding label, as elucidated in Table 1. Specifically, the dataset comprises 7717 sentences marked as offensive and 12,685 sentences categorized as non-offensive, resulting in a distribution where 37.8% of sentences are deemed offensive, while the remaining 62.2% are not.

Understanding the distribution of sentence lengths plays a crucial role in choosing the right techniques and algorithms to use. Sentence lengths range from 1 to 150 words, with frequencies representing how many sentences share the same word count, ranging from 1 to 1707 occurrences. A depiction of the sentence length distribution, provided in Figure 2, reveals a right-skewed pattern characterized by a longer tail on the right-hand side. The distribution’s median sentence length is nine words, while its mode stands at four words. Notably, the calculation of the interquartile range (IQR) for the lower 5th percentile and upper 95th percentile within the sentence length distribution discloses that approximately 90% of sentence lengths fall within the range between 2 and 34.

Short sentences of four words are very common since social media content is the predominant form of content within the dataset. Social media encourages brevity in writing, so comments on these platforms tend to take on a more informal ’reaction’- or ’response’-type formats as opposed to full sentence constructions. Such sentences do not normally need much expansion. It is also worth remarking that this dataset contains a lot of insults and expressions of frustration. In sentences like these, a lot can often be said with very few words.

Moroccan online users seamlessly switch between the Latin and Arabic scripts to express their thoughts. This linguistic phenomenon is faithfully reflected in our dataset, which encapsulates a blend of sentences in both scripts. Arabic-script sentences predominate with a percentage of 74% (14,978 sentences), while 25% of the sentences (5170 sentences) are presented in the Latin script. Additionally, a subset of 1% of the sentences (254 sentences) showcases a combination of both scripts.

The initial step involved gathering data, with sentences being sourced from two primary platforms: Twitter and YouTube. Twitter, being a globally used platform, served as a valuable source of textual content in the form of tweets. We employed the Twitter API and retrieved data by using a set of commonly used Moroccan Darija words. These words served as filters to define the scope of the tweets to be collected. Among the utilized keywords were terms like “chkoun” (who), “kifach” (how), “sir” (go), etc. Once the initial batch of tweets was obtained, we leveraged the originating accounts to acquire further tweets from the same accounts. This approach was based on the assumption that accounts that tweeted at least once in Moroccan Darija are more likely to produce additional content in the same language.

In parallel, the project’s scope encompassed the extraction of comments from 40 renowned Moroccan channels, most notably “ChoufTV”. These channels boasted a substantial volume of comments in Darija. Given that YouTube employs JavaScript rendering, our methodology involved employing Selenium, which operates by simulating user actions, such as scrolling and clicking, enabling the retrieval of data despite the platform’s intricate rendering mechanism. Table 2 presents a list of data samples, their English translations, and corresponding labels.

The data preprocessing phase involved several key steps. We used lowercasing to ensure consistency in letter casing. We replaced emojis with their textual descriptions to extract semantic information such as emotions or symbols. We utilized the Arabic Emojipedia repository (https://github.com/a-ibrahimi/Arabic-Emojipedia (accessed on 4 January 2024)) for this purpose. A set of special characters that do not contribute to the sentence’s meaning was removed. Notable examples of these removed characters are punctuation marks and characters such as the asterisk ‘*’, hyphen ‘-’, and ampersand ‘@’. We also handled elongated characters, which are commonly used in social media. For instance, users may use “salaaam” instead of “salam” (hello in Moroccan Darija). As no specialized tools were available for detecting Moroccan Darija words with elongated characters, we opted for a straightforward approach: identifying words containing more than two consecutive instances of the same character and replacing them with just two instances of that character. Finally, we normalize Arabic text. This step takes care of bringing Arabic text to a unified form by removing diacritics from all letters, and by removing elongations.

In order to annotate the data in a consistent way, we established a list of traits and guidelines that make a sentence offensive, which can be summarized as follows: (i) vulgar language, such as the use of swear words, profanities and inappropriate references to private parts or sexual acts; (ii) name calling, such as the use of abusive names to belittle or humiliate another person; (iii) hate speech, such as attacking religious beliefs and the use of homophobic and racist language; and (iv) derogatory language, such as the use of pejorative terms relating to illness or disabilities, besides sexism, evil wishes, direct threats and family insults.

The labeling process involved the collaboration of two human annotators who followed the predefined guidelines to determine whether a sentence was offensive or not. The selected annotators are native speakers who demonstrated expertise in the sentiment analysis task. Training sessions were then held in the forms of meetings to align on and agree on the guidelines to be followed. The inter-annotator agreement rate for this study was 88.79%. For the data samples that resulted in conflicting labels, a third annotator, who went through the same pre-screening and training process as the first annotators, was consulted for resolution.

Despite the important resources invested in collecting, preprocessing, and labeling the dataset, we still encountered a handful of limitations. Namely, there was a lack of diversity in types of comments, including offensive text, due to the fact that only two sources were used for data collection: Twitter and YouTube. Indeed, the comments and tweets retrieved only represent a subset of Moroccan online users, encompassing relatively biased content. This limitation also relates to the nature of the data sources used. The prevalence of short sentences in social media content, which is clearly manifested in the dataset shown in Figure 2, limits contextual depth. Moreover, only about 37% of the sentences are labeled as offensive, meaning that the dataset suffers from class imbalance. Having a balanced dataset proves crucial for several classification techniques and algorithms. Therefore, it is necessary to address class imbalance to improve model performance and ensure accurate classification results.

3.2. Deep Learning Models for Offensive Language Detection

We fine-tuned five language models on the offensive language detection task. We used two Arabic language models, namely MARBERT and ARBERT [60], besides three Darija language models, including DarijaBERT [61] and two other compact models, DarELECTRA and DarRoBERTa [62]. We opted for experimenting with three different classifier models: CNNs, LSTMs and a combined CNN-LSTM architecture. In all experiments, we generated the pooled embedding vector from the sequence embedding vectors using a dense layer with a tanh activation function.

The CNN architecture we opted for consists of a one-dimensional convolution layer with L1 regularization used for feature extraction, followed by a max pooling layer for feature condensation, and a dropout layer with a 0.1 dropout rate. Next, we flatten the feature map, and we use a dense layer with L2 regularization, followed by the output layer using a sigmoid activation function that classifies the comments as offensive or not offensive.

For the second experiment, the transformers’ pooled output was fed to an LSTM layer with 128 hidden units and L1 regularization. A dropout layer with a 0.1 dropout rate was then applied, followed by a dense layer with L2 regularization, and finally the output layer for classification.

The third experiment consists of a combination of a CNN and LSTM that was proven to be effective for a variety of tasks. This combination allows CNN’s filters to extract n-gram features that are passed to the LSTM layer to discover feature dependencies. As a result, word-level features are extracted first by the CNN, and sentence-level features are highlighted by the LSTM. Our model consists of a one-dimensional convolution layer with L1 regularization, followed by a max pooling layer for feature condensation, and a dropout layer with a 0.1 dropout rate. The output is flattened and then passed to a LSTM layer with 128 hidden units and L1 regularization, followed by a dropout layer with 0.2 dropout rate. Finally, we use a dense layer with L2 regularization followed by the output layer for classification.

When the different classes are not present in the training data in equal proportions, the more samples of a class we have, the more that class will determine gradient updates. To overcome the unbalanced data issue, we use class weighting, i.e., while computing the total loss, a weighted sum is performed instead of a regular sum, in which each sample’s loss is weighted according to its class. Each class is assigned a weight that is inversely proportional to its frequency in training data. This results in samples from less frequent classes having a larger contribution to total loss and thus to gradient updates. The model is therefore protected from over-generating labels in the majority class. Accordingly, the weight of the offensive class was set to 1.321, while the weight of the non-offensive class was set to 0.8042. In our work, the offensive class has a higher weight since we have fewer offensive labels than non-offensive ones.

3.3. Model Testing Beyond Accuracy

In this work, we draw inspiration from software engineering testing to evaluate the best performing offensive language detection model from previous experiments in terms of correctness, robustness and fairness. This is carried out by applying metamorphic testing in a black-box setting and combining character-level and word-level modifications.

In order to generate test cases to challenge the model with, we proceed in two main steps. First, we identify the most important words in a sentence to be modified using the algorithm in Figure 3. Then, we change the identified tokens in the sentence and re-test the model with the new inputs. The goal of this step is to identify the two most important words, based on their ability to significantly reduce the model’s confidence or change its prediction completely [35,63]. Only two words are selected for perturbation to generate adversarial data, as making more alterations could render the sentence incomprehensible [40,47]. First, the importance of each word is determined by calculating a combination of its head score and tail score. The head score is calculated by subtracting the model’s prediction score for the sentence up until the word, from the model’s prediction score for the same sentence without the word. The tail score is calculated similarly: by subtracting the model’s prediction score for the sentence starting with the word and continuing to the end, from the model’s prediction score for the same sentence without the word. The head and tail scores are important in determining the importance of each word in a sentence because they provide a measure of how much impact removing that word has on the overall meaning of the sentence. The head score represents the contribution of the word to the beginning of the sentence, while the tail score represents the contribution of the word to the end of the sentence. The head and tail scores may yield negative values when we subtract the predictions, so they are converted to positive values. The combined score, obtained by adding the head and the tail scores, serves as a metric to determine the importance of each word. The word exhibiting the highest combined score is deemed the most important one, as it induces the most significant change. This stems from the fact that this score is the summation of the absolute values of the head and tail scores, reflecting the extent to which the word’s presence affects the overall meaning of the sentence.

The main idea of the second step is to provide specified and personalized metamorphic rules (MRs) that not only describe the perturbations performed on the source data but also the results against which we evaluate the output of our test data [64]. In this work, we define eight MRs to test the three properties of the model. These MRs take into consideration the usual real-world errors that can be made by humans while typing, as well as purposeful attacks performed against the model. The assumption from the designed MRs for the model to be considered error-free is that its predictions should remain invariant under any type of induced perturbation.

To assess the correctness of the system, we test if it performs well when provided with correct clean data. The modifications that were made preserve the correctness and meaning of the original sentences. The different MRs that were developed for testing correctness are described in Table 3.

To test the robustness of the model, we provide it with slightly altered data, introducing modifications that make the sentences incorrect but still understandable for humans. The model should see through these alterations and comprehend the sentences. To ensure effective adversarial data generation while preserving semantic similarity, we considered several controls. We specifically limit the number of edits, such as changing a letter in a word or substituting it with synonyms, to a maximum of two. This approach ensures that the meaning of the sentence is not drastically altered while still introducing sufficient perturbations to confuse the target model. If more than two words are modified, the meaning of the sentence might drift, and inaccuracies in the test results may arise due to inevitable errors by the model. In situations where we swap or change only one letter, we perturb two words. However, for more significant perturbations such as replacing a word with “unknown”, we only modify a single word to avoid changing the meaning of the sentence. Moreover, the generated adversarial examples are constrained within generic semantic spaces, such as typo spaces and contextualized semantic spaces using the embedding space. By leveraging these spaces, the adversarial texts remain semantically close to the original inputs, helping to preserve meaning while testing the model’s robustness. Due to the lack of robust semantic similarity models for Darija that could automate the process, we assess the readability and semantic preservation of the adversarial texts through a manual evaluation. An annotator manually reviewed modified sentences to determine if the texts remained intelligible and similar in meaning to the original ones. The different MRs that were developed for testing the model’s robustness are described in Table 4.

When it comes to fairness, the objective is to evaluate if the model performs well when the attributes of the sentences are changed to sensitive ones that are usually subject to bias and discrimination such as ethnicity, religion nationality, and gender. We use the same function as in the correctness test to change the named entities, except we replace them with sensitive names this time. The MRs that were developed for testing the model’s fairness are described in Table 5.

To produce adversarial data, we perturbed the original test set. For each attack, the adversarial dataset contains 4086 instances, except for modifications that involve a single character perturbation. In those cases, we combined the transformations equally to create a test set of 4086 instances. This test set includes the following attacks: deleting one letter, inserting one letter, swapping two letters, and replacing one letter with another one. For attacks that involve substituting similar words or altering named entities, we generated test sets with only 1000 instances from the original test set for each type. This choice stems from the requirement that comments must include named entities and must be in Arabic script. In terms of labeling, the ground truth of the adversarial datasets against which the model’s predictions are evaluated is the same as in the original test set.

4. Results and Discussion

To evaluate the model’s performance on the test data, we used accuracy (acc), precision (prec), recall (rec), the F1 score, and the success rate. In this research, we used NVIDIA Tesla T4 Graphics Processing Units (GPUs) provided by Google Colab Pro platform.

4.1. Deep Learning Models for Offensive Language Detection

For the offensive language detection task, we used a split of 85% for training and 15% for validation. As can be seen in Table 6, DarRoBERTA performed slightly better than the other pre-trained models, achieving an accuracy of 90% and an F1 score of 85%. Even though MARBERT and ARBERT were trained on less MD instances than the other MD models, they achieved high scores, which confirms the findings in the literature, i.e., using a diversified dataset composed of formal and informal text in Arabic and its dialects for pre-training or fine-tuning BERT can improve the effectiveness of Arabic text classification. We also noticed that the ARBERT-based model achieved the highest recall (89%), indicating its ability to capture the highest proportion of correctly predicted offensive content, i.e., the true positives out of all actual offensive instances. The highest precision was achieved by the ELECTRA-based Darija language model (89%), indicating its ability to minimize false positives when flagging comments as offensive.

It should be noted that although performances of the different pre-trained Darija models are similar, DarELECTRA (55 m parameters) and DarRoBERTa (80 m parameters) are much compacter than DarijaBERT, which is a BERT-based model (110 m parameters).

While accuracy was the highest score in all experiments, the F1 score was not as high. This indicates that all of the models were good at classifying one label but not the other. A manual error analysis shows that this phenomenon could be explained by a number of factors: (i) an inability of the models to detect sarcasm and irony in offensive tweets, (ii) the use of a negative word to express a positive meaning, and (iii) the use of some words that are inappropriate in some regions but accepted in others.

In the task of offensive language detection, recent benchmarks highlight the advanced performances of LLMs like GPT-3.5, Mistral, and Flan-T5 [65]. In terms of robustness, the findings from [65] revealed that while all three models demonstrated reliable performance across languages, their results varied depending on language structure and dataset composition. Mistral and GPT-3.5 showed strong adaptability across languages, with Mistral particularly excelling in English and Spanish. However, GPT-3.5’s superior handling of German offensive language detection suggests its robustness in languages with more complex grammatical structures. The findings also indicated that GPT-3.5’s content moderation safeguards sometimes interfered with straightforward predictions, potentially affecting its output consistency in sensitive contexts. Flan-T5, with fewer parameters, remained effective in English but was less robust in non-English settings, showing reduced precision and recall. Overall, while Mistral and GPT-3.5 exhibit higher robustness in multilingual tasks, the distinct moderation policies and smaller model capacity of Flan-T5 limit its adaptability outside of English contexts. The best results for the three languages under monolingual and multilingual settings are shown in Table 7.

In comparison to the results obtained using Mistral and GPT-3.5, our RoBERTa-based model achieved competitive performance, particularly in precision, despite its relatively compact size. The comparison to results from [65] opens up an avenue for future work to enhance our results by exploring techniques such as fine-tuning on multilingual datasets.

4.2. Testing the Properties of the Best-Performing Model

We selected the best-performing model, DarRoBERTa LM combined with a CNN-LSTM classifier, as the victim model. We evaluated the failure of the model by using the adversarial success rate. The metric is defined as the percentage of the adversarial examples that were successful in making the victim model change its classification from one class to another, compared to the predictions on the original test set. Each attack has its own success rate, as defined by (1).

(1) $Success rate = \frac{Number of successful attacks}{Total attacks}$

The results of the experiments indicate that the attacks succeeded in deceiving the model, with some perturbations being more successful than others. If we consider the results of correction testing, as shown by Table 8, we can observe a decrease in performance against specific adversarial attacks. Despite a slight maximum reduction of 2% in the F1 score, the results highlight potential vulnerabilities in the model’s correctness and generalization capabilities. The 7.9% success rate for inserting filler words suggests that the model is sensitive to extraneous information, showing a vulnerability in handling irrelevant or noisy data. The offensive language detection model was trained mainly on clean data and seems to struggle when confronted with noisier data. The 4% success rate for replacing named entities indicates a moderate vulnerability in the model’s understanding of the contextual importance of specific entities. This suggests that, while the model can often maintain performance despite these changes, it shows weaknesses in comprehending some of the named entities and their significance in identifying offensive content. The 1.1% success rate for replacing words with similar ones might indicate a low effectiveness of this approach in altering the model’s predictions, likely due to the robustness of transformer LLMs, which use byte-based tokenizers, making them less sensitive to word-level changes.

The results of robustness testing are shown in Table 9. The model’s vulnerability to minor perturbations such as inserting a dot or space between letters, or modifying one letter in a word, resulted in success rates ranging from 17.9% to 29.4%. This character-level model vulnerability, which seems to be a common problem in most NLP systems, shows that the model relies primarily on exact words. On the other hand, the 16.1% success rate for deleting spaces between words indicates that the model’s ability to parse and understand word boundaries correctly is moderately affected. Repeating vowels resulted in a 13.4% success rate, suggesting that the model might struggle with informal or colloquial language patterns, such as vowel repetition, possibly exacerbated by preprocessing steps aimed at data normalization. The 10.6% success rate for replacing a word with “unknown” suggests that BERT’s contextual understanding and attention mechanisms can mitigate the impact of replacing words with the “unknown” token. This resilience could be due to its ability to infer missing information from the surrounding context. Attacks involving changing punctuation or numbers had low success rates, ranging from 0.42% to 6.1%. However, the absence of failure in this case may be skewed by a lack of training data diversity in these aspects. A manual inspection of the original training data revealed that it did not include much punctuation or numerical information, leading the model to rely less on these features.

Overall, the robustness success rates are not negligible. This could be explained by the fact that these attacks target the most important words as determined by the important tokens algorithm, which seems to frequently select offensive words when present. Modifying these words can cause the sentence to lose its offensive meaning, leading the model to make errors in its predictions. This performance drop is particularly exhibited by an increase in the false negative rate, which is reflected by a large drop in recall that reaches 15% in certain test cases. We also noticed a significant decrease in precision when the model encountered words with inserted dots, with false positives rising sharply as precision dropped by 28%. This drop in precision may be attributed to the language model learning a pattern that associates the presence of dots with the intent to obscure offensive text. On social media, users often employ dots as a tactic to bypass content moderation systems, leading the model to interpret any text with dots as potentially offensive. This pattern recognition can result in an increased likelihood of false positives, as the model may misidentify unoffensive language containing dots as offensive. This limitation underscores the need for improved learning to differentiate between normal uses of dots and intentional obfuscation of harmful language.

To test the fairness of the model, the attacks focused on replacing named entities with sensitive names. The model is clearly affected by bias to a certain extent. The specific biases detected during our testing include nationality-related bias and gender bias. For instance, the model was more likely to classify sentences as offensive when a male name was replaced with a female name, leading to an increase in offensive classifications. Similarly, the predictions made by the model completely changed when the nationalities were replaced with sensitive names, especially considering that most of the countries and nationalities were substituted with those that have political conflicts with the country from where the dataset was generated. These patterns reflect how the model may inadvertently associate certain demographic or cultural groups with negative sentiments. The results displayed in Table 10 show a 7% success rate, which reflects the reliance of the model on specific named entities, instead of building an understanding of named entities and their impact on whether a comment is offensive or not.

Examples of adversarial data along with their classifications are shown in Table 11. Overall, the results indicate that there are some types of attacks concerning robustness and fairness, which highlight specific vulnerabilities in the model’s handling of input perturbations. Despite these vulnerabilities, it is noteworthy that the results align with baseline findings, falling within the expected ranges of failure reported on state-of-the-art models, including production systems.

The interplay between biased training data and model architecture limitations significantly affects adversarial robustness in offensive language detection. The dataset, drawn exclusively from Twitter and YouTube, introduces bias by offering a limited representation of Moroccan online users. This bias can undermine the effectiveness of adversarial attacks, as the generated adversarial examples may fail to reflect the linguistic diversity and context found across different platforms. Consequently, this limitation could restrict the success of attacks in effectively targeting the offensive language model. The lack of diversity in the original data means that the adversarial testing might not reveal all the vulnerabilities of the model. Adversarial data might be biased towards the specific characteristics of the original dataset, leading to an incomplete assessment of the model’s robustness. If adversarial testing is primarily conducted with examples similar to those in the original dataset, the model might overfit to these specific adversarial patterns. This can lead to a situation where the model is optimized to handle only a narrow set of adversarial examples, rather than being robust to a wide range of adversarial attacks. Moreover, the prevalence of short sentences in our data limits the contextual depth of the data. Adversarial examples created from such limited contexts may struggle to fully exploit the model’s weaknesses, as the lack of nuances could reduce the chances of successful manipulations. The low success rate of attacks that rely on word-level transformations highlights the fact that transformer tokenizers, like Byte Pair Encoding (BPE) or WordPiece, decompose words into subword units. This decomposition reduces the vulnerability of models to adversarial attacks that attempt to manipulate specific words, whether by substituting them with synonyms or by introducing misspellings. As a result, the effectiveness of such attacks is diminished, highlighting the need for more sophisticated methods to generate quality adversarial data.

5. Conclusions and Future Work

In this work, we present a contribution towards building an offensive language detection system in Moroccan Darija. We introduce a human-labeled dataset comprising a collection of Darija sentences collected from social media platforms. This study addresses the scarcity of labeled datasets for Darija and provides a resource for researchers focusing on natural language processing that are interested in Darija, particularly offensive language and sentiment analysis tasks.

We also fine-tuned several BERT-based models on the offensive language detection task. Interestingly, the smaller Darija-specific language models achieved the best results. Using compact models will have an impact on online metrics when the ML model is deployed, such as on the latency, requests per second, and resource utilization.

We tested the best model using adversarial data to evaluate correctness, robustness, and fairness. The results demonstrate that the attacks were able to deceive the model and reveal its vulnerabilities. Although adversarial attacks designed to test correctness had relatively low success rates, the model showed vulnerabilities in robustness and fairness that were not revealed with standard evaluations. The testing pinpointed the specific characteristics of data that led to incorrect predictions and discerned the areas of vulnerabilities at the levels of words or characters.

Low-resource languages face several fundamental challenges, such as the scarcity of large, labeled datasets; the variability across dialects; and the lack of foundational linguistic tools like Part-of-Speech taggers and Named Entity Recognition systems. In the case of Moroccan Darija, these challenges are compounded by the diglossic nature of the language, where the standard Arabic used in official contexts differs significantly from the spoken dialects. As a result, models trained on standard Arabic often perform poorly when applied to regional dialects. In recent years, research on low-resource languages has gained momentum, leveraging techniques like transfer learning and multilingual models (e.g., mBERT and XLM-R), which can transfer knowledge from high-resource languages. However, such approaches often suffer from generalization issues due to insufficient fine-tuning on the specific dialects or sub-languages. This gap is particularly evident in NLP tasks, such as offensive language detection, where model performance is crucial for social media moderation and public safety but remains under-developed for languages like MD. The contribution of this study lays the groundwork for addressing the broader need for dialect-specific solutions that account for linguistic and cultural variations in low-resource languages. Moreover, the techniques used here are transferable to other Arabic dialects and could be beneficial for multidialectical Arabic efforts, where models that generalize across dialects are needed. This can help bridge the gap in NLP for Arabic as a whole, supporting efforts in dialect-specific research while contributing to a shared framework.

The limitations of this work stem from the inherent complexity of NLP systems and the testing methodologies used. While metamorphic testing and adversarial attacks were effective in uncovering certain vulnerabilities related to robustness and fairness, other bugs may still remain undetected. This is due to the challenges of crafting comprehensive adversarial examples that can explore the full input space, particularly for languages with rich morphology and syntax, like Arabic. Moreover, the tested adversarial cases may not cover all real-world variations or contexts in offensive language, leaving some edge cases and system weaknesses unexamined. Additionally, relying solely on adversarial testing may not expose all latent issues in the data and models, especially biases that emerge in unexpected usage scenarios. This limitation stems from the lack of variability in our dataset, which does not necessarily cover the full spectrum of regional or linguistic nuances, potentially leaving the model vulnerable to performance gaps in real-world applications.

In light of the limited body of NLP research on Moroccan Darija, it is also crucial to outline strategic directions for advancing this field. First, there is a need for the development of comprehensive, annotated datasets specific to Darija to facilitate more robust training and evaluation of NLP models. Second, research into models that can effectively handle variations across different Arabic dialects is essential for advancing NLP applications in Arabic-speaking countries. Finally, there is a need to focus on methods for effective code-switching and adaptations, as switches between languages reflect authentic language use in many social contexts.

A future direction of this work is to improve the offensive language detection model’s resilience against adversarial attacks by exploring different types of defense methods, such as using adversarial training, which entails training the model on a combination of original and adversarial data. We will also experiment on newer LLMs and investigate multilingual offensive language detection.

Author Contributions

Conceptualization, A.M.; methodology, A.M., A.I., I.A., M.A. and M.A.E.B.; validation, A.I., I.A., M.A.E.B. and S.D.; formal analysis, A.I., I.A., M.A., M.A.E.B. and S.D.; investigation, A.I., I.A., M.A., M.A.E.B. and S.D.; resources, A.M.; data curation, A.I., M.A.E.B. and M.A.; writing—original draft preparation, A.I., I.A., M.A.E.B. and A.M.; writing—review and editing, A.I. and A.M.; visualization, A.I., I.A. and M.A.E.B.; supervision, A.M.; project administration, A.M.; funding acquisition, A.M. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The original data presented in the study are openly available at https://data.mendeley.com/datasets/2y4m97b7dc/3 (accessed on 29 July 2024).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of this study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Declaration of Generative AI and AI-Assisted Technologies in the Writing Process:

During the preparation of this work, the authors used the paraphrasing tool Quillbot and ChatGPT in order to improve readability. After using these tools, the authors reviewed and edited the content as needed and take responsibility for the publication’s content.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. Machine learning workflow with metamorphic testing using adversarial data.

Figure 2. Distribution of sentence lengths in the dataset.

Figure 3. Algorithm for finding important words in a sentence.

Table 1

Dataset specifications.

Subject	Data Science
Specific subject area	Offensive language detection in the Moroccan dialect
Type of data	Tabular
How the data were acquired	The data were collected using Twitter’s API and web scraping of YouTube comments.
Data format	Raw filtered
Description of data collection	Two tools were utilized to extract data. The open API for Twitter was used to extract Moroccan tweets, whereas the open source web scraping tool Selenium was employed to scrape YouTube comments. The data labeling process was conducted manually by human annotators. The labeled sentences were then saved in a tabular form as a CSV file.
Data source location	The Web
Data accessibility	Public repository Repository name: Mendeley Data identification number: 10.17632/2y4m97b7dc.3 Direct URL to data: https://data.mendeley.com/datasets/2y4m97b7dc/3 (accessed on 9 July 2024)

Table 2

Sample dataset sentences with English translations and labels.

Original Text	English Translation	Label
bghit nmchi l italie	I want to go to Italy	Inoffensive
المغرب اوعاشا عزازعليهم حنا علينا زادو واخا	Despite the price increases they inflicted on us they like us and long live Morocco	Inoffensive
الانتخابات مع عنديش ما خي و قلبي من انصرك الله	God bless you even if I don’t like the elections	Inoffensive
المعفون عليه تفو اللسان كبرلو الشعر كبر	As his hair grew, so did his tongue, so disgusting	Offensive
البشر لحوم واكل للمفسدين القيمه نعطي اصبحنا لمن انظر	Look at whom we have come to value: the corrupt and cannibals	Offensive
دارو ف يغرسو النعناع يشرب بغا اللي	Whoever wants to drink mint tea should plant it in their own home	Inoffensive
daba achno l faida men had les vaccins ila kolchi mrad o le nombre des cas tla3 o bnadem kaymout lah iyakhod lhak	Now what is the purpose of the vaccine if everyone is sick and the number of cases is rising and people are dying, may God bring justice	Inoffensive
sir 9aleb 3la chi khedma	Go find yourself a job	Inoffensive
twe7echt sou7abi bzaf	I miss my friends so much	Inoffensive
فقط تنظيمه و تنقيته يجب لكن جميل التقليدي السوق	The traditional market is beautiful but it needs to be cleaned and organized	Inoffensive
هدشي من كتر العداب خاصو الشعب هد	This people deserve more punishment	Offensive
جهنم الي والتنميه العداله حزب	The Justice and Development Party to Hell	Offensive
مكلخ شعب وحنا	We are a foolish people	Offensive
zink 1 comprimé kol sbah	One Zink tablet every morning	Inoffensive

Table 3

Metamorphic rules for testing the model’s correctness.

Metamorphic Rule	Description
MR1: Inserting filler words	Filler words are words that we naturally add to our sentences when speaking and that do not alter the semantic meaning of the sentence such as “uh” or “I mean…”. In this transformation, we insert filler words at the middle, end, or beginning of a sentence. The filler words added at the middle are different from those added at the end or beginning because some words may disrupt the readability of the sentence when inserted in certain positions. The added filler words do not change the meaning of the sentence.
MR2: Substituting words with their semantic neighbors	We replace a word in a sentence by a related one in terms of meaning and context, e.g., replacing ‘my hands hurt’ by ‘my eyes hurt’. We use Darija word embeddings to substitute the most significant words with others that are similar in terms of semantics, meaning, or context [62]. After loading the word embeddings, we use the Gensim library along with its KeyedVectors class to find a word with the same meaning as the input word using the cosine similarity. The function produces a list of comparable words and their respective similarity scores. We select a word that has a similarity score of 0.7 or higher.
MR3: Modifying the named entities	We replace certain named entities with others from the same category. For example, we substitute one city name with another. To achieve this, we employ a combination of two different NER tools as each excels with specific types of entities. Specifically, we use CAMeLBERT MSA to identify persons and locations [29] and Marefa NER to identify nationalities, occupations, and temporal expressions [47]. While not originally designed for MD, we have chosen to use these tools due to the limited availability of suitable resources for MD at the time of performing the present research.

Table 4

Metamorphic rules for testing the model’s robustness.

Metamorphic Rule	Description
MR4: MRs 4.1 to 4.6 investigate how much changing characters in words affects the model performance.
MR4.1: Swapping letters	We permute two characters in the middle of important words.
MR4.2: Inserting keyboard-adjacent typo errors	We simulate a keyboard typo error by adding a single character in the middle of a word. This character is the one that is adjacent in the keyboard to the letter inserted next to it.
MR4.3: Replacing letters with keyboard-adjacent typo errors	We replace one letter by its adjacent neighbor on the keyboard. The purpose is to have a more disruptive transformation because not all the characters of an original word are still present.
MR4.4: Replacing letters with less subtle typo errors	We modify the middle letter in the most important words by replacing it with random characters to create noise.
MR4.5: Repeating vowels	We repeat a vowel in the middle or end of a word to convey emotions.
MR4.6: Deleting letters	We delete one letter from the middle of important words.
MR5: MRs 5.1 to 5.4 test the impact of changing words on the performance of the model.
MR5.1: Replacing words by ‘Unknown’	We modify only one important word with the token ’unknown’ to not alter the meaning of a sentence significantly.
MR5.2: Inserting dots between letters	We insert a dot between each pair of letters in important words.
MR5.3: Inserting white spaces between letters	We insert a space between subsequent letters of important words.
MR5.4: Deleting white spaces between words	We delete the space between two adjacent words, ensuring that at least one is important.
MR6: MRs 6.1 and 6.2 check whether changing the punctuation in the sentences matters for the robustness of the model.
MR6.1: Modifying punctuation	We delete parentheses or add them, if they are not present. We also modify all periods and commas with exclamation points ‘!!’ because they serve as accentuators.
MR6.2: Inserting punctuation	We add exclamation points (!!!) if no punctuation is present, avoiding their placement before important words.
MR6.3: Deleting punctuation	We delete all punctuation if there is any.
MR7: MRs 7.1 to 7.3 check if perturbing the numbers in a sentences affects the performance of the model.
MR7.1: Inserting numbers	We insert random numbers in the middle and the end of a sentence. The numbers added at the end are years to simulate a time indication.
MR7.2: Modifying numbers	We replace all the numbers in a sentence by zeros ‘0 s’.
MR7.3: Deleting numbers	We delete all the numbers from a sentence.

Table 5

Metamorphic rules for testing the model’s fairness.

Metamorphic Rule	Description
MR8: MRs 8.1 to 8.3 test the fairness of the model by changing the named entities with sensitive names.
MR8.1: Modifying the country	We replace the countries with those currently at war or that have political conflicts with the substituted one.
MR8.2: Modifying the nationality	We replace the nationalities with those that are related to wars or political conflicts.
MR8.3: Modifying gender	We replace a male name or pronoun with a female name or pronoun.

Table 6

Offensive language detection results. Bold values indicate the best results.

Model		Training				Validation
		Acc	Prec	Rec	F1	Acc	Prec	Rec	F1
DarELECTRA	CNN	0.95	0.93	0.95	0.93	0.89	0.85	0.84	0.84
	LSTM	0.95	0.92	0.96	0.94	0.89	0.85	0.84	0.84
	CNN-LSTM	0.91	0.90	0.87	0.87	0.89	0.89	0.81	0.84
DarRoBERTa	CNN	0.95	0.91	0.90	0.90	0.89	0.88	0.84	0.84
	LSTM	0.97	0.94	0.97	0.95	0.90	0.87	0.84	0.85
	CNN-LSTM	0.95	0.95	0.93	0.93	0.90	0.88	0.84	0.85
DarijaBERT	CNN	0.97	0.95	0.98	0.96	0.89	0.83	0.87	0.84
	LSTM	0.95	0.92	0.96	0.93	0.89	0.83	0.88	0.84
	CNN-LSTM	0.93	0.91	0.91	0.90	0.89	0.84	0.87	0.84
ARBERT	CNN	0.97	0.95	0.97	0.96	0.88	0.83	0.86	0.83
	LSTM	0.97	0.95	0.98	0.96	0.88	0.83	0.86	0.84
	CNN-LSTM	0.96	0.93	0.96	0.94	0.88	0.81	0.89	0.84
MARBERT	CNN	0.96	0.94	0.96	0.95	0.89	0.87	0.85	0.85
	LSTM	0.95	0.92	0.94	0.93	0.88	0.84	0.85	0.84
	CNN-LSTM	0.97	0.95	0.98	0.96	0.89	0.85	0.85	0.84

Table 7

Overview of top performance results from [65] across monolingual and multilingual settings in English, Spanish, and German.

Setting	Language	Dataset	LLM	Prec	Rec	F1
Monolingual	English	OLID + SOLID	Mistral	89.1	94.2	90.9
	Spanish	OffendES	Mistral	82.2	87.7	84.4
	German	GermEval 2018	GPT-3.5	85.8	83.6	84.5
Multilingual	English	Combined Datasets	Mistral	89.4	94.4	91.2
	Spanish		Mistral	84.0	86.7	85.2
	German		GPT-3.5	84.1	83.9	84.0

Table 8

Correctness testing results.

Dataset	Accuracy	Precision	Recall	F1-Score	Success Rate
Original dataset	0.85	0.85	0.90	0.87	-
Insert filler words	0.83	0.82	0.89	0.85	7.9%
Replace a word by a similar word	0.84	0.85	0.87	0.85	1.1%
Replace named entities	0.84	0.82	0.90	0.85	4%

Table 9

Robustness testing results.

Dataset	Accuracy	Precision	Recall	F1-Score	Success Rate
Original dataset	0.85	0.85	0.90	0.87	-
Modify one character	0.76	0.82	0.80	0.81	17.9%
Modify character by random noise	0.76	0.78	0.83	0.80	18.3%
Modify punctuation	0.84	0.83	0.90	0.86	5.8%
Modify numbers	0.84	0.85	0.90	0.87	0.84%
Insert dots between letters	0.65	0.57	0.82	0.67	29.4%
Insert space between letters	0.72	0.83	0.75	0.79	24.5%
Repeat vowels	0.77	0.72	0.90	0.80	13.4%
Insert numbers	0.86	0.89	0.88	0.88	6.1%
Delete numbers	0.84	0.86	0.89	0.87	1.16%
Delete punctuation	0.84	0.85	0.90	0.87	0.42%
Delete space between two words	0.79	0.84	0.82	0.83	16.1%
Replace a word by ‘unknown’ token	0.82	0.88	0.84	0.86	10.6%

Table 10

Fairness testing results.

Dataset	Accuracy	Precision	Recall	F1-Score	Success Rate
Original dataset	0.85	0.85	0.90	0.87	-
Replace named entities by sensitive names	0.83	0.82	0.86	0.86	7%

Table 11

Examples of misclassifications with adversarial data.

Adversarial Attack	Adversarial Input	Model Output
Repeat vowels	Original: خلقه علي الله كرم الذي الكاءن انت Translation: You are the being whom God has honored above all His creation.	Not offensive
Repeat vowels	Adversarial: خلقه علي الله كرم الذييييي الكاااااءن انت Translation: You are the beeeeing whom God has hoooonored above aall His creation.	Offensive
Inserting dots	Original: ليهم يغفر الله التعليم رجال Translation: May God forgive the educators	Not offensive
Inserting dots	Adversarial: ل.ي.ه.م يغفر الله التعليم ر.ج.ا.ل. Translation: May God f.o.r.g.i.v.e the e.d.u.c.a.t.o.r.s	Offensive
Character deletion	Original: 9rbo tani linti5abat welad l7ram Translation: The elections are approaching the time of bastards	Offensive
Character deletion	Adversarial: rbo tani lintiabat welad lram Translation: The elections are pproaching the time of bstards	Not offensive
Named entity change	Original: فتامارا ديما سمير وولدي انا Translation: Me and my son Samir we are always in hard labor	Not offensive
Named entity change	Adversarial: فتامارا ديما لمياء وبنتي انا Translation: Me and my daughter Lamiae we are always in hard labor	Offensive
Named entity change	Original: ففرنسا عايش مغربي انا Translation: I am a Moroccan living in France	Not offensive
Named entity change	Adversarial: ففرنسا عايش جزائري انا Translation: I am an Algerian living in France	Offensive

References

1. Muneer, A.; Fati, S.M. A Comparative Analysis of Machine Learning Techniques for Cyberbullying Detection on Twitter. Future Internet; 2020; 12, 187. [DOI: https://dx.doi.org/10.3390/fi12110187]

2. Aghzal, M.; Mourhir, A. Distributional Word Representations for Code-mixed Text in Moroccan Darija. Procedia Comput. Sci.; 2021; 189, pp. 266-273. [DOI: https://dx.doi.org/10.1016/j.procs.2021.05.090]

3. Sedrati, A.; Ait Ali, A. Moroccan Darija in Online Creation Communities: Example of Wikipedia. Al-Andal. Maghreb; 2019; 26, pp. 1-14. [DOI: https://dx.doi.org/10.25267/AAM.2019.i26.11]

4. Morocco, P.C. Moroccan Darija Textbook. 2011; Available online: https://friendsofmorocco.org/Docs/Darija/Moroccan%20Arabic%20textbook%202011.pdf (accessed on 7 September 2024).

5. El-Hairan, Z. Darija, the Evolution of Oral Arabic in Morocco. 2011; Available online: https://www.academia.edu/8123140/Darija_the_evolution_of_Oral_Arabic_in_Morocco (accessed on 9 November 2024).

6. Braiek, H.B.; Khomh, F. On Testing Machine Learning Programs. J. Syst. Softw.; 2020; 164, 110542. [DOI: https://dx.doi.org/10.1016/j.jss.2020.110542]

7. Ackerman, S.; Farchi, E.; Raz, O.; Zalmanovici, M.; Dube, P. Detection of Data Drift and Outliers Affecting Machine Learning Model Performance Over Time. arXiv; 2020; arXiv: 2012.09258

8. Brożek, B.; Furman, M.; Jakubiec, M.; Kucharzyk, B. The Black Box Problem Revisited. Real and Imaginary Challenges for Automated Legal Decision Making. Artif. Intell. Law; 2024; 32, pp. 427-440. [DOI: https://dx.doi.org/10.1007/s10506-023-09356-9]

9. Asudeh, A.; Shahbazi, N.; Jin, Z.; Jagadish, H. Identifying Insufficient Data Coverage for Ordinal Continuous-Valued Attributes. Proceedings of the 2021 International Conference on Management of Data; Virtual, 20–25 June 2021; pp. 129-141.

10. Aggarwal, A.; Shaikh, S.; Hans, S.; Haldar, S.; Ananthanarayanan, R.; Saha, D. Testing Framework for Black-box AI Models. Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion); Madrid, Spain, 25–28 May 2021; pp. 81-84.

11. Liang, B.; Li, H.; Su, M.; Bian, P.; Shi, X.L.; Wang, W. Deep Text Classification Can Be Fooled. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18); Stockholm, Sweden, 13–19 July 2018; pp. 4208-4215.

12. Prabhakaran, V.; Hutchinson, B.; Mitchell, M. Perturbation Sensitivity Analysis to Detect Unintended Model Biases. arXiv; 2019; arXiv: 1910.04210

13. Akhter, M.P.; Jiangbin, Z.; Naqvi, I.R.; AbdelMajeed, M.; Zia, T. Abusive Language Detection from Social Media Comments Using Conventional Machine Learning and Deep Learning Approaches. Multimed. Syst.; 2022; 28, pp. 1925-1940. [DOI: https://dx.doi.org/10.1007/s00530-021-00784-8]

14. Hajibabaee, P.; Malekzadeh, M.; Ahmadi, M.; Heidari, M.; Esmaeilzadeh, A.; Abdolazimi, R. Offensive Language Detection on Social Media Based on Text Classification. Proceedings of the 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC); Las Vegas, NV, USA, 26–29 January 2022; pp. 0092-0098.

15. Jahan, M.S.; Oussalah, M. A Systematic Review of Hate Speech Automatic Detection Using Natural Language Processing. Neurocomputing; 2023; 546, 126232. [DOI: https://dx.doi.org/10.1016/j.neucom.2023.126232]

16. Ashok Kumar, J.; Abirami, S.; Trueman, T.E.; Cambria, E. Comment Toxicity Detection via a Multichannel Convolutional Bidirectional Gated Recurrent Unit. Neurocomputing; 2021; 441, pp. 272-278.

17. Al-Hassan, A.; Al-Dossari, H. Detection of Hate Speech in Arabic Tweets Using Deep Learning. Multimed. Syst.; 2022; 28, pp. 1963-1974. [DOI: https://dx.doi.org/10.1007/s00530-020-00742-w]

18. Alatawi, H.S.; Alhothali, A.M.; Moria, K.M. Detecting White Supremacist Hate Speech Using Domain Specific Word Embedding with Deep Learning and BERT. IEEE Access; 2021; 9, pp. 106363-106374. [DOI: https://dx.doi.org/10.1109/ACCESS.2021.3100435]

19. Nikolov, A.; Radivchev, V. Nikolov-Radivchev at SemEval-2019 Task 6: Offensive Tweet Classification with BERT and Ensembles. Proceedings of the 13th International Workshop on Semantic Evaluation; Minneapolis, MN, USA, 6–7 June 2019; pp. 691-695.

20. Ranasinghe, T.; Zampieri, M.; Hettiarachchi, H. BRUMS at HASOC 2019: Deep Learning Models for Multilingual Hate Speech and Offensive Language Identification. Proceedings of the FIRE (Working Notes); Hyderabad, India, 16–20 December 2020 2019; pp. 199-207.

21. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv; 2018; arXiv: 1810.04805

22. Mohaouchane, H.; Mourhir, A.; Nikolov, N.S. Detecting Offensive Language on Arabic Social Media Using Deep Learning. Proceedings of the 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS); Granada, Spain, 22–25 October 2019; pp. 466-471.

23. Abdelsamie, M.M.; Azab, S.S.; Hefny, H.A. A Comprehensive Review on Arabic Offensive Language and Hate Speech Detection on Social Media: Methods, Challenges and Solutions. Soc. Netw. Anal. Min.; 2024; 14, 111. [DOI: https://dx.doi.org/10.1007/s13278-024-01258-1]

24. Althobaiti, M.J. BERT-based Approach to Arabic Hate Speech and Offensive Language Detection in Twitter: Exploiting Emojis and Sentiment Analysis. Int. J. Adv. Comput. Sci. Appl.; 2022; 13, pp. 972-980. [DOI: https://dx.doi.org/10.14569/IJACSA.2022.01305109]

25. El-Alami, F.-z.; Ouatik El Alaoui, S.; En Nahnahi, N. A Multilingual Offensive Language Detection Method Based on Transfer Learning from Transformer Fine-Tuning Model. J. King Saud Univ.-Comput. Inf. Sci.; 2022; 34, pp. 6048-6056. [DOI: https://dx.doi.org/10.1016/j.jksuci.2021.07.013]

26. Koshiry, A.M.E.; Eliwa, E.H.I.; Abd El-Hafeez, T.; Omar, A. Arabic Toxic Tweet Classification: Leveraging the AraBERT Model. Big Data Cogn. Comput.; 2023; 7, 170. [DOI: https://dx.doi.org/10.3390/bdcc7040170]

27. Saeed, R.; Afzal, H.; Rauf, S.A.; Iltaf, N. Detection of Offensive Language and Its Severity for Low Resource Language. ACM Trans. Asian Low-Resour. Lang. Inf. Process.; 2023; 22, 156. [DOI: https://dx.doi.org/10.1145/3580476]

28. Pitenis, Z.; Zampieri, M.; Ranasinghe, T. Offensive Language Identification in Greek. arXiv; 2020; arXiv: 2003.07459

29. Niraula, N.B.; Dulal, S.; Koirala, D. Offensive Language Detection in Nepali Social Media. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021); Kolkata, India, 12–15 December 2021; pp. 67-75.

30. Nandi, A.; Sarkar, K.; Mallick, A.; De, A. A Survey of Hate Speech Detection in Indian Languages. Soc. Netw. Anal. Min.; 2024; 14, 70. [DOI: https://dx.doi.org/10.1007/s13278-024-01223-y]

31. Goyal, S.; Doddapaneni, S.; Khapra, M.M.; Ravindran, B. A Survey of Adversarial Defenses and Robustness in NLP. ACM Comput. Surv.; 2023; 55, pp. 1-39. [DOI: https://dx.doi.org/10.1145/3593042]

32. Chen, T.Y.; Cheung, S.C.; Yiu, S.M. Metamorphic Testing: A New Approach for Generating Next Test Cases. arXiv; 2020; arXiv: 2002.12543

33. de Oliveira, G.A.; de Sousa, R.T.; de Oliveira Albuquerque, R.; García Villalba, L.J. Adversarial Attacks on a Lexical Sentiment Analysis Classifier. Comput. Commun.; 2021; 174, pp. 154-171. [DOI: https://dx.doi.org/10.1016/j.comcom.2021.04.026]

34. Hosseini, H.; Kannan, S.; Zhang, B.; Poovendran, R. Deceiving Google’s Perspective API Built for Detecting Toxic Comments. arXiv; 2017; arXiv: 1702.08138

35. Samanta, S.; Mehta, S. Generating Adversarial Text Samples. Advances in Information Retrieval; Springer: Cham, Switzerland, 2018; pp. 744-749.

36. Belinkov, Y.; Bisk, Y. Synthetic and Natural Noise Both Break Neural Machine Translation. arXiv; 2017; arXiv: 1711.02173

37. Formento, B.; Foo, C.S.; Tuan, L.A.; Ng, S.K. Using Punctuation as an Adversarial Attack on Deep Learning-Based NLP Systems: An Empirical Study. Proceedings of the Findings of the Association for Computational Linguistics: EACL 2023; Dubrovnik, Croatia, 2–6 May 2023; pp. 1-34.

38. Alsmadi, I.; Ahmad, K.; Nazzal, M.; Alam, F.; Al-Fuqaha, A.; Khreishah, A.; Algosaibi, A. Adversarial Attacks and Defenses for Social Network Text Processing Applications: Techniques, Challenges and Future Research Directions. arXiv; 2021; arXiv: 2110.13980

39. Alzantot, M.; Sharma, Y.; Elgohary, A.; Ho, B.-J.; Srivastava, M.; Chang, K.-W. Generating Natural Language Adversarial Examples. arXiv; 2018; arXiv: 1804.07998

40. Jia, R.; Raghunathan, A.; Göksel, K.; Liang, P. Certified Robustness to Adversarial Word Substitutions. arXiv; 2019; arXiv: 1909.00986

41. Ribeiro, M.T.; Singh, S.; Guestrin, C. Semantically Equivalent Adversarial Rules for Debugging NLP Models. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Melbourne, Australia, 15–20 July 2018; pp. 856-865.

42. Eger, S.; Şahin, G.G.; Rücklé, A.; Lee, J.-U.; Schulz, C.; Mesgar, M.; Swarnkar, K.; Simpson, E.; Gurevych, I. Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems. arXiv; 2019; arXiv: 1903.11508

43. Fu, X.; Gu, Z.; Han, W.; Qian, Y.; Wang, B. Exploring Security Vulnerabilities of Deep Learning Models by Adversarial Attacks. Wirel. Commun. Mob. Comput.; 2021; 2021, 9969867. [DOI: https://dx.doi.org/10.1155/2021/9969867]

44. Zang, Y.; Qi, F.; Yang, C.; Liu, Z.; Zhang, M.; Liu, Q.; Sun, M. Word-level Textual Adversarial Attacking as Combinatorial Optimization. arXiv; 2019; arXiv: 1910.12196

45. Tu, K.; Jiang, M.; Ding, Z. A Metamorphic Testing Approach for Assessing Question Answering Systems. Mathematics; 2021; 9, 726. [DOI: https://dx.doi.org/10.3390/math9070726]

46. Iyyer, M.; Wieting, J.; Gimpel, K.; Zettlemoyer, L. Adversarial Example Generation with Syntactically Controlled Paraphrase Networks. arXiv; 2018; arXiv: 1804.06059

47. Ren, S.; Deng, Y.; He, K.; Che, W. Generating Natural Language Adversarial Examples Through Probability Weighted Word Saliency. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Florence, Italy, 28 July–2 August 2019; pp. 1085-1097.

48. Jiang, M.; Bao, H.; Tu, K.; Zhang, X.Y.; Ding, Z. Evaluating Natural Language Inference Models: A Metamorphic Testing Approach. Proceedings of the 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE); Wuhan, China, 25–28 October 2021; pp. 220-230.

49. Ma, P.; Wang, S.; Liu, J. Metamorphic Testing and Certified Mitigation of Fairness Violations in NLP Models. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20); Yokohama, Japan, 7–15 January 2020; pp. 458-465.

50. Xu, Y.; Zhou, Z.Q.; Zhang, X.; Wang, J.; Jiang, M. Metamorphic Testing of Named Entity Recognition Systems: A Case Study. IET Softw.; 2022; 16, pp. 386-404. [DOI: https://dx.doi.org/10.1049/sfw2.12058]

51. Sun, Y.; Ding, Z.; Huang, H.; Zou, S.; Jiang, M. Metamorphic Testing of Relation Extraction Models. Algorithms; 2023; 16, 102. [DOI: https://dx.doi.org/10.3390/a16020102]

52. Yan, B.; Yecies, B.; Zhou, Z.Q. Metamorphic Relations for Data Validation: A Case Study of Translated Text Messages. Proceedings of the 2019 IEEE/ACM 4th International Workshop on Metamorphic Testing (MET); Montreal, QC, Canada, 26 May 2019; pp. 70-75.

53. Jiang, M.; Chen, T.Y.; Wang, S. On the Effectiveness of Testing Sentiment Analysis Systems with Metamorphic Testing. Inf. Softw. Technol.; 2022; 150, 106966. [DOI: https://dx.doi.org/10.1016/j.infsof.2022.106966]

54. Wang, B.; Xu, C.; Liu, X.; Cheng, Y.; Li, B. SemAttack: Natural Textual Attacks via Different Semantic Spaces. arXiv; 2022; arXiv: 2205.01287

55. Tsai, Y.-T.; Yang, M.-C.; Chen, H.-Y. Adversarial Attack on Sentiment Classification. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP; Florence, Italy, 1 August 2019; pp. 233-240.

56. Rusert, J.; Shafiq, Z.; Srinivasan, P. On the Robustness of Offensive Language Classifiers. arXiv; 2022; [DOI: https://dx.doi.org/10.48550/arXiv.2203.11331]

57. Ribeiro, M.T.; Wu, T.; Guestrin, C.; Singh, S. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Online, 5–10 July 2020; 442. [DOI: https://dx.doi.org/10.18653/v1/2020.acl-main.442]

58. Abdelaty, M.; Lazem, S. Investigating the Robustness of Arabic Offensive Language Transformer-Based Classifiers to Adversarial Attacks. Proceedings of the 2024 Intelligent Methods, Systems, and Applications (IMSA); Giza, Egypt, 13–14 July 2024; pp. 109-114.

59. Manerba, M.M.; Tonelli, S. Fine-grained Fairness Analysis of Abusive Language Detection Systems with CheckList. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021); Bangkok, Thailand, 6 August 2021; pp. 81-91.

60. Abdul-Mageed, M.; Elmadany, A.; Nagoudi, E.M.B. ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing; Online, 1–6 August 2021; pp. 7088-7105.

61. Gaanoun, K.; Naira, A.M.; Allak, A.; Benelallam, I. DarijaBERT: A Step Forward in NLP for the Written Moroccan Dialect. Int. J. Data Sci. Anal.; 2024; [DOI: https://dx.doi.org/10.1007/s41060-023-00498-2]

62. Aghzal, M.; Bouni, M.A.E.; Driouech, S.; Mourhir, A. Compact Transformer-based Language Models for the Moroccan Darija. Proceedings of the 2023 7th IEEE Congress on Information Science and Technology (CiSt); Agadir–Essaouira, Morocco, 16–22 December 2023; pp. 299-304.

63. Gao, J.; Lanchantin, J.; Soffa, M.L.; Qi, Y. Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers. Proceedings of the 2018 IEEE Security and Privacy Workshops (SPW); San Francisco, CA, USA, 24 May 2018; pp. 50-56.

64. Segura, S.; Fraser, G.; Sanchez, A.B.; Ruiz-Cortés, A. A Survey on Metamorphic Testing. IEEE Trans. Softw. Eng.; 2016; 42, pp. 805-824. [DOI: https://dx.doi.org/10.1109/TSE.2016.2532875]

65. He, J.; Wang, L.; Wang, J.; Liu, Z.; Na, H.; Wang, Z.; Chen, Q. Guardians of Discourse: Evaluating LLMs on Multilingual Offensive Language Detection. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2410.15623] arXiv: 2410.15623

Word count: 13525

Show less

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Moroccan Darija, a dialect of Arabic, presents unique challenges for natural language processing due to its lack of standardized orthographies, frequent code switching, and status as a low-resource language. In this work, we focus on detecting offensive language in Darija, addressing these complexities. We present three key contributions that advance the field. First, we introduce a human-labeled dataset of Darija text collected from social media platforms. Second, we explore and fine-tune various language models on the created dataset. This investigation identifies a Darija RoBERTa-based model as the most effective approach, with an accuracy of 90% and F1 score of 85%. Third, we evaluate the best model beyond accuracy by assessing properties such as correctness, robustness and fairness using metamorphic testing and adversarial attacks. The results highlight potential vulnerabilities in the model’s robustness, with the model being susceptible to attacks such as inserting dots (29.4% success rate), inserting spaces (24.5%), and modifying characters in words (18.3%). Fairness assessments show that while the model is generally fair, it still exhibits bias in specific cases, with a 7% success rate for attacks targeting entities typically subject to discrimination. The key finding is that relying solely on offline metrics such as the F1 score and accuracy in evaluating machine learning systems is insufficient. For low-resource languages, the recommendation is to focus on identifying and addressing domain-specific biases and enhancing pre-trained monolingual language models with diverse and noisier data to improve their robustness and generalization capabilities in diverse linguistic scenarios.

Details

Title

Investigating Offensive Language Detection in a Low-Resource Setting with a Robustness Perspective

Author

Abdellaoui, Israe; Ibrahimi, Anass; Mohamed Amine El Bouni; Mourhir, Asmaa

; Driouech, Saad; Aghzal, Mohamed

First page

170

Publication year

2024

Publication date

2024

Publisher

MDPI AG

e-ISSN

25042289

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/bdcc8120170

ProQuest document ID

3149498810