Word Game Modeling Using Character-Level N-Gram

Full text

Turn on search term navigation

1. Introduction

In the context of teaching languages, vocabulary is considered one of the main components that have to be mastered by learners, due to its significance and use for communication. According to [1], vocabulary contributes a lot to an understanding of written and spoken texts and it also helps in learning the functions and applicability of new words. Vocabulary mastery affects the way of thinking of students and their creativity in the language learning process, so that the quality of their language learning can also improve. As [2] said, second language learners require an adequate bank of words so as not to impede effective and meaningful communication. The vocabulary of children aged 5–12 can be improved by letter games, especially cubic-based letter games.

Research studies [3,4,5,6] show the importance of word games in the teaching process for school children. They enable children to improve their vocabulary retention skills and also help them to learn words and arrange the letters which make up a word. In the initial stage of a word game, players are able to recognize the letters and words. In the next stages, they learn to form a word by using the correct order of letters, and, in the process, improve their memory building, which expands their vocabulary.

The matching letter game (MLG) aims to construct words by using letter cubes. The MLG develops various skills in transforming letters into words for children [7,8]. It is specially designed for different levels of difficulty, customized for 3–12 year old children. MLG is very suitable for early childhood education and entertainment. Not only can it enlighten children’s learning and word formation ability, as well as early vocabulary recognition skills, but it can also help children practice thinking and hands-on skills, such as sorting, grouping, taking turns, and sharing.

The game contains letter cubes, flashcards with predefined words, including a picture defining the word, and a game tray with a slot for cubes while matching letters into a word. The child looks at the picture and word on a card and, then, tries to find each matching letter from the big letter cubes to form a word and puts the letters on the tray in correct order for the word.

Currently realized cubic-oriented matching letter games [9] are mainly focused on children aged 3–8 and on the English language. Most of them consist of 8 to 16 letter cubes with options forming up to 64 words from given word flashcards, and the length of the words is 3–4 letters. This means the games include a lot of number cubes, even with options forming a limited number of words. The letter placement algorithm for this type of game has not been defined for the Uzbek language.

Uzbek language (native: . O’zbek tili) is a low-resource, highly-agglutinative language [10] with null-subject and null-gender characteristics from the Karluk branch of the Turkic language family. One of the most noticeable distinctions between Uzbek and other Turkic languages is the rounding of the vowel, a feature that was influenced by Persian. The Uzbek language has been written in various scripts, such as Arabic, Cyrillic, and Latin. The Latin script https://en.wikipedia.org/wiki/Uzbek_alphabet (accessed on 1 November 2022) is an official alphabet [11] in the Uzbek language.

Taking into account the aforementioned issues, we created a model for designing a cubic-oriented matching-letter game that minimizes the number of letter cubes, while maintaining coverage of as many words as possible. The model is trained in a selected dataset of the Uzbek language in Latin script and the output is a set of letters to allocate to the sides of the cubes sides to form words. To evaluate the performance of the developed models, we created a new dataset for the Uzbek language. For the English language, we used an extracted dataset. For the Russian and Slovenian languages, we generated datasets which include 3–5 letter words, according to the suggestion of experts in the field. A simple stochastic method was also applied to see if it could further improve the overall coverage of the proposed models.

The structure of the rest of the paper is as follows. Section 2 presents an overview of existing methods related to our algorithm. The problem definition and our contributions are stated in Section 3. A detailed description of our proposed model is provided in Section 4. Section 5 outlines the main results obtained by the proposed models followed by Section 6, that concludes the paper and highlights our future plans.

2. Scientific Background

In computer science literature, there are some methods of designing word games for different languages, and some research on the characteristics of a target language has been conducted. Since there is no method of design for cubic-word games in the Uzbek language, our current research can be considered a good contribution to this field for the Uzbek language. In this section, we discuss the research performed in the word-game modeling field.

In [12], the authors introduced a solution algorithm for the Uzbek language version of the Wordle game, including the Cyrillic script of this language. This work studied character statistics of the Uzbek five-letter words in Cyrillic script to determine the solving strategy of the word game. The proposed methodology covered computing the letter frequency and positional letter frequency of Uzbek words in Cyrillic script and calculating a word score, based on statistics suggesting the best probability words to obtain a solution in the game with minimal attempts.

The authors in [13] conducted new research to show the effectiveness of the Big Cube game in improving students’ mastery of vocabulary. They designed Big Cube to play a “words and pictures guessing” game by involving 40 grade 8 students as subjects of the research. The authors proved the importance of the cubic game to achieve good success in teaching through experimental evaluations. Pre-test and post-test (both tests comprised 20 questions divided into two sections: multiple choice and sentence completion) results obtained from the 40 students were compared, and statistically proved the significance of cubic games in the study process. This work also shows the importance of our research, because we designed a new model to produce cubic-game cards for children while the study in [13] demonstrated the usefulness of such games in teaching and improving mastery of vocabulary for students, as well as for children.

In [14], the Vietnamese researchers manifested the appropriateness and efficacy of incorporating word games into English lexical instruction at a middle school in Vietnam. They conducted research on two classes of grade 7 students for eight weeks in order to measure the efficacy of using games in the selected school. The authors statistically proved that the post-test findings after eight weeks of treatment were better in vocabulary retention and showed the importance of word games in the teaching process, especially in vocabulary retention, and in learning words for school children.

The study in [15] showed the effectiveness of using the Build-A-Sentence Cubes Game in teaching simple past tense. They involved 16 students from the Eighth Level of the Q-Learning Course Pontianak as subjects of the research. The results of the post-test were revealed to be noticeably higher than the results obtained in the pre-test. This means that the use of the Build-A-Sentence Cubes Game had a moderate effect in teaching the simple past tense to Eighth Level students of the Q-Learning Course Pontianak.

The main goal in our study is similar to those of the aforementioned works, but we developed novel methods to build a cubic game, while existing studies applied word games in the pedagogical process. The Uzbek language is a low-resource language and there are no publicly available corpora. Therefore, designing algorithms for the Uzbek language brings forth challenges. Since we did not find statistics for Letter frequency and two-letter sequence (Bi-gram) calculated in the Uzbek corpus, we focused on a training dataset in the Uzbek language.

3. Problem Statement

The input dataset, consisting of 3–5 letter words, is given, and our task was to suggest a set of letters to be described in cubes for the matching letter game. There are six sides of a cube, and a letter is placed on each of the six sides of each cube. Before the game’s design, we needed to identify the number of cubes it would have. We selected cases of five, six, seven, and eight cubes because these numbers of cubes are appropriate to cover 3–5 letter words.

We wanted the word game, made as a result of the proposed method, to enable the formation of as many words as possible from the words in the selected dataset with a minimal number of cubes. Some examples of the letter-matching word game for the Uzbek, English, and Russian languages are illustrated in the following figures.

Figure 1 shows an illustrative example of matching 3–5 letter words by using 6 cubes for the Uzbek language. (a) Examples of 3 letter words constructed from the above cubes (b) and (c) illustrate 4 and 5 letter words made by the same cubes, respectively.

The examples of matching 3–5 letter English words by using 5 cubes are presented in Figure 2. The cases in (a)–(c) describe the matching of 3, 4, and 5 letter words constructed by 5 cubes for the English language.

In Figure 3, the same examples of matching 3–5 letter words are illustrated for the Russian language.

While proposing the approach, we set some conditions and restrictions. Since the game is designed for children to get acquainted with letters, the condition was that all letters of the alphabet must be present in the game cubes, at least once. We defined two restrictions to optimize the model. Firstly, a letter is not placed more than once on a single cube to increase the effectiveness, because a player can only use one side of a cube when matching letters for a word. By using different letters, more words can be formed than by using the same letters. Secondly, the number of vowel letters for each cube should be two (or three, depending on the method) to balance the distribution of vowels in cubes.

As a result of the proposed methods, the obtained set of words from the resulting cubes of the game was computed to measure the accuracy of the model.

Our main contributions and goals are as follows:

Since Uzbek is one of the low-resource languages, and there are not enough datasets for it, we created a novel dataset for the Uzbek language to conduct this research;
We developed new models by using character-level N-gram and statistics to design a cubic-oriented matching letter game for Uzbek and other languages;
Experimental evaluations on 4 datasets were performed to show the advantages of our methods.

4. Methodology

In this work, we proposed two models for constructing the cubes, which include the following steps: (1) Data preparation; (2) Frequency of letters generation, using Uni-gram and Bi-gram methods; (3) Extraction of the sequence of letters which are potential candidates to be placed on the cubes; (4) Creation of a replacement algorithm by considering some restrictions in placing the letters on the final cubes. The proposed approaches are illustrated in Figure 4.

Each of the aforementioned steps in the above figure is described in the following subsections.

4.1. Data Preparation

To test our models, We created a new dataset for the Uzbek language, because there is no dataset for young children in the Uzbek language. The dataset generation part is described in the following steps:

(1). Collection of words. The potential words were extracted from the Madvaliyev and Begmatov book [16], which is the largest dictionary book for the Uzbek language.
(2). Normalization. After the generation of the word list for our dataset, we performed normalization to simplify the coding.
- (a). There are some digraphs in Uzbek language with diacritical signs which cause problems in calculating letter frequency. g’ and o’ replaced with ḡ and ō. With these changes, we replaced the digraph with one character, which was considered as two characters in a word, and this helped in correctly computing the letter frequency. Other digraph letters, ch and sh are divided into s and h, two independent letters, while c is not a letter for the Uzbek alphabet. We decided to keep it as a digraph and it participated as a separate character in letter frequency calculation.
- (b). The Uzbek alphabet has a character that is called a phonetic glottal stop (native: Tutuq belgisi). Although it is not a real letter, it is still considered a part of the Uzbek alphabet. There are only 18 words with this character in the Uzbek dataset, so, therefore, we omitted these words in the filtering process.
(3). Filtering. After the normalization, 7948 words, with 3–5 characters, were left in our dataset. Then, these words were filtered by a native Uzbek speaker and an expert in Uzbek linguistics, by removing very rare words and unfamiliar words for school children to create a final dataset (the newly created dataset for the Uzbek language is publicly available https://github.com/UlugbekSalaev/Cubic-Word-Game-Modeling (accessed on 1 November 2022)) with 4558 words.

To check the performance of our model in other languages, an English dataset was obtained from ESL Forums online page https://eslforums.com (accessed on 10 October 2022), designed for young children to learn the language. The datasets contained a list of common words having 3-5 letters in the English language. Russian and Slovenian datasets [17,18] were generated with the support of experts in the Russian and Slovenian languages, and they helped to extract useful and interesting words for school children.

Russian language words and its Part of Speech tag list https://www.artint.ru/projects/frqlist/frqlist-en.php (accessed on 10 October 2022) were obtained from open-source corpus http://opencorpora.org/ (accessed on 10 October 2022). Rare, uncommon, and difficult words for school children were removed from all datasets.

4.2. Statistics

We calculated the statistics for the positional distribution of letters on the selected datasets to find out the regularities between occurrence of vowels and consonants in words.

The Uzbek language has 27 letters and six of them are vowel phonemes: a, e, i, o, u, o‘. English and Russian alphabets consist of 26 and 33 letters, including 5 (a, e, i, o, u) and 10 {а, у, o, ы, и, э, я, ю, ё, е} vowels, respectively. The Slovenian alphabet includes 25 letters, 5 of them being vowels a, e, i, o, u.

It can be seen from Table 1 that most of the vowels have high frequencies in all languages, which means that vowels actively participate in forming words.

The subsets of vowels {a, i, o} for the Uzbek language, {e, a, o, i} for the English language, {a, e, o, i} for the Slovenian language and {а, o, е} for the Russian language are the most frequent vowels, having relative frequencies higher than $5 %$ .

Table 1 shows that vowel letters stand out in high-ranked Uzbek words. Considering the active participation of vowels in the formation of words, we studied the dynamics of positional occurrences of letters at vowel and consonant levels for the selected datasets. Table 2 illustrates the pattern view of word formation with vowel and consonant letters.

A vowel’s frequency is higher in words with short word length (3–5 letters). In order to increase the possibility of forming words, we can increase the number of vowels in a game cube. Table 2 shows that most of the 4 and 5 letter words consist of two vowels.

Taking into consideration the letter distribution of all datasets which we wanted to experiment with in the proposed approach, we decided to develop another model by increasing the priority of vowel letters in the word formation stage. Increasing the number of vowel letters in words also improves the overall coverage of the model.

4.3. Methods

According to the letter distribution and the statistics of vowel positional occurrence, we proposed two approaches to design a cubic game. Both approaches are mainly based on character-level Uni-gram and Bi-gram models [19]. Algorithm 1 describes the frequency of the character-level n-gram model.

Algorithm 1: Algorithm of the

C L_n g r a m_F r e q u e n c y

method to calculate of character-level n-gram model frequency

Input: dataset D, number of character n

Output: Dictionary (list of key-value pair)

Initialization: Empty dictionary

F r e q u e n c y

to store the frequency of character n-gram

1:. for each $w o r d \in D$ do
2:. for $(i = 0; i \leq w o r d . l e n g t h - n; i + +)$ do
3:. if $w o r d [i : i + n) \in F r e q u e n c y . k e y s ()$ then
4:. Increment $F r e q u e n c y [w o r d [i : i + n)]$ by 1
5:. else
6:. $F r e q u e n c y [w o r d [i : i + n)]$ = 1
7:. end if
8:. end for
9:. end for
10:. return $F r e q u e n c y$

$C L_n g r a m_F r e q u e n c y$ algorithm finds the frequencies of words, based on character n-gram. If n is equal to 1, it is called Uni-gram and the method returns a dictionary consisting of 1-letter words as well as their frequencies (i.e., a:45, b:30). If n is equal to 2, the method is called Bi-gram and it returns a dictionary including 2 letter chunks with their frequencies (i.e., ab:45, bc:30).

4.3.1. Letter Frequency ( $L F$ ) Approach

In the first method, we designed a model based on letter frequency called $L F$ . Bigram frequency plays an important role in the perception of words and non-words [20]. More precisely, we utilized the Bi-gram method to sort the alphabet letters, based on the frequencies, and remaining letters were obtained based on the Uni-gram method as duplicate letters. The procedures of the first model are highlighted in Algorithm 2.

Algorithm 2: Algorithm for designing letters for cubic game based on

L F

approach

Input: A set of an alphabet letters L, dataset D, number of cubes N

Output: Set of letters to form cubes corresponding to N

Initialization: Assign list of alphabet to A, number of cubes to N, empty list

D L

to store the duplicate letters, empty list

C L

to store letters sequence, empty 2D array

C u b e s

with length Nx6

1:. $U n i g r a m_F r e q u e n c y$ = $C L_n g r a m_F r e q u e n c y$ (D, 1); ▹ Returns dictionary which contains a letter (key) and its frequency (value), i.e., [ $^{'} a^{'}$ :90, $^{'} b^{'}$ :75, ...]
2:. $B i g r a m_F r e q u e n c y$ = $C L_n g r a m_F r e q u e n c y$ (D, 2); ▹ Returns dictionary which contains 2 letters (key) and its frequency (value), i.e., [ $^{'} e r^{'}$ :62, $^{'} r a^{'}$ :23, ...]
3:. $U n i g r a m_F r e q u e n c y$ = $s o r t$ ( $U n i g r a m_F r e q u e n c y$ , value);
4:. $B i g r a m_F r e q u e n c y$ = sort( $B i g r a m_F r e q u e n c y$ , value);
5:. for each $k e y \in B i g r a m_F r e q u e n c y$ do
6:. if $k e y [0] \notin C L$ then
7:. $C L$ .add( $k e y [0]$ ); ▹ first letter of key
8:. end if
9:. if $k e y [1] \notin C L$ then
10:. $C L$ .add( $k e y [1]$ ); ▹ second letter of key
11:. end if
12:. if $C L = L$ then
13:. break
14:. end if
15:. end for
16:. if $C L \neq L$ then
17:. $C L$ .add( $L ∖ C L$ )
18:. end if
19:. $D L$ = $U n i g r a m_F r e q u e n c y [0 : N * 6 - L . l e n g t h)$
20:. $D L$ = sort( $D L$ , $C L$ )
21:. for $(i = 0; i < N; i + +)$ do
22:. for $(j = 0; j <$ 6 $; j + +)$ do
23:. $C u b e s [i] [j]$ = $C L [j * 6 + i]$
24:. end for
25:. end for
26:. return $C u b e s$

Algorithm 2 takes a training dataset, a number of cubes, and the alphabet of the language as input parameters. Lines 1–2 compute the Uni-gram and Bi-gram frequencies, based on Algorithm 1. The next 2 lines sort the Uni-gram and Bi-gram frequencies according to their values in descending order. The sequence of the letters is constructed, based on Bi-gram positions, in lines 5–15. The resulting sequence of letters includes the alphabet letters but in a different order. If some of the alphabet letters do not exist in the dataset, those letters were added to the resulting sequence of letters in lines 16–18. The sequence of letters may not be equal to the intended number of letters for cubes, and the remaining letters are selected based on Uni-gram frequencies. Lines 21–25 place the sequence of letters into the cubes, and the last line returns the final set of cubes.

4.3.2. Vowel Priority ( $V L$ ) Approach

In all four languages considered, vowel letters play the most important role to construct 3–4–5 letter words, as demonstrated in Table 2. By taking into account that fact, we decided to design another model, called $V L$ , which gives an advantage to inclusion of vowel letters in the cubes. The selection process of vowel letters to be included in the cubes is shown in Algorithm 3.

Algorithm 3:

F r e q u e n t_V o w e l s

method returning a subset of vowels having frequencies higher than 5% of the dataset

Input: Letter frequency dictionary

L F

(item has key-value pair)

Output: A list of vowels

Initialization: Empty list

F r e q u e n t_V o w e l s

, Integer number

T o t a l

to 0

1:. $T o t a l = \sum_{i = 1}^{L F . l e n g t h} L F [i] . v a l u e$
2:. for each $i t e m \in L F$ do
3:. if $i t e m . v a l u e / T o t a l > 0.05$ then
4:. $F r e q u e n t_V o w e l s$ .add( $i t e m . k e y$ )
5:. end if
6:. end for
7:. return $F r e q u e n t_V o w e l s$

The input of Algorithm 3 is a sorted $L F$ (Unigram_Frequency). Line 1 counts the total number of characters of a dataset from the given Letter Frequency. Lines 2–6 extract a subset of vowels, having occurrence frequency higher than 5%. The occurrence frequency is calculated by dividing the value of letter frequency by the total number of characters. Line 7 returns an extracted list of vowels as frequent vowels. The proposed model is defined in Algorithm 4.

Algorithm 4: Algorithm for designing of letters for cubic game based on

V L

approach

Input: A set of an alphabet letters L, a list of vowels to V, a list of consonant to C, dataset D, number of cubes N

Output: Set of letters to form cubes corresponding to N

Initialization: Assign list of alphabet to A, number of cubes to N, empty list

D L

to store the duplicate letters, empty list

C L

to store the sequence of letters, empty 2D array

C u b e s

with length of Nx6

1:. $U n i g r a m_F r e q u e n c y$ = $C L_n g r a m_F r e q u e n c y$ (D, 1); ▹ Returns dictionary which contains a letter (key) and its frequency (value), i.e., [ $^{'} a^{'}$ :90, $^{'} b^{'}$ :75, ...]
2:. $B i g r a m_F r e q u e n c y$ = $C L_n g r a m_F r e q u e n c y$ (D, 2); ▹ Returns dictionary which contains 2 letters (key) and its frequency (value), i.e., [ $^{'} e r^{'}$ :62, $^{'} r a^{'}$ :23, ...]
3:. $U n i g r a m_F r e q u e n c y$ = $s o r t$ ( $U n i g r a m_F r e q u e n c y$ , value);
4:. $B i g r a m_F r e q u e n c y$ = sort( $B i g r a m_F r e q u e n c y$ , value);
5:. for each $k e y \in B i g r a m_F r e q u e n c y$ do
6:. for ( $i = 0; i \leq 1; i + +)$ do
7:. if $k e y [i] \notin C L$ then
8:. $C L$ .add( $k e y [i]$ ); ▹ i-letter of key
9:. end if
10:. end for
11:. if $C L = L$ then
12:. break
13:. end if
14:. end for
15:. $C L = C L + (L ∖ C L$ )
16:. $D L = (V + F r e q u e n t_V o w e l s (U n i g r a m_F r e q u e n c y) + C) [0 : N * 6 - L . l e n g t h)$
17:. $D L$ = sort( $D L$ , $C L$ )
18:. for $(i = 0; i < N; i + +)$ do
19:. for $(j = 0; j <$ 6 $; j + +)$ do
20:. $C u b e s [i] [j]$ = $C L [j * 6 + i]$
21:. end for
22:. end for
23:. return $C u b e s$

Algorithm 4 has the alphabet letters, a list of vowel and consonant letters, a dataset, and the number of cubes as input parameters. The first 4 lines are the same as in the first model $L F$ , which sorts the Uni-gram and Bi-gram frequencies in descending order. Lines 5–14 build the positions of letters, based on the Bi-gram method, and puts them into the $C L$ list. While some of the alphabet letters are not included in the resulting sequence generated by lines 5–14, those letters are added to the list in line 15. The resulting list of letters should be equal to $N x 6$ , so the remaining letters from the alphabet are filled by frequent vowels (line 16), described in Algorithm 3. Line 17 sorts the $D L$ list, based on the $C L$ letter sequence. The letters are placed into cubes in lines 18–22 and the resulting set of cubes is returned in the last line.

5. Experimental Results

Developed models ( $L F$ and $V L$ ) were tested in 4 cases of numbers of cubes: 5, 6, 7, and 8 cubes. A 5-fold cross-validation evaluation method was utilized to perform the experiments. We evaluated our models on three datasets taken from the Uzbek, English, Russian, and Slovenian languages. Since our main goal was to apply the proposed model to the Uzbek language, we created a new dataset by involving language experts. The detailed information about datasets is presented in Table 3.

The developed datasets mostly contain words from “Noun”, “Verb” and “Adjective” word classes. The detailed distribution of words by word class is manifested in Table 4.

Overall coverage (average values over the 5-fold cross-validation with standard deviations) of $V L$ and $L F$ models in the case of 5, 6, 7, and 8 cubes are shown in Table 5, Table 6, Table 7 and Table 8. The best coverage for each dataset is shown in bold.

Table 5 illustrates the performance of $L F$ and $V L$ models in the case of 5 cubes. It can be seen from the result that both models resulted in lower coverage in all datasets for 4–5 letter cases. The probability of constructing 4–5 letter words (especially, 5 letter words) with 5 cubes is really low. Both models obtained the same coverage on the Russian dataset with 48.6% in the 3 letter case, 18.2% in the 4 letter case, and 3.0% in the 5 letter case. The reason is that the cubic letters (30 letters for 5 cubes) did not include all the alphabet letters because the Russian language’s alphabet has 33 letters.

Table 6 manifests the overall coverage in the case of 6 cubes. Overall coverage of both models was improved by around 30% in all datasets, compared to the case of 5 cubes. $L F$ approach achieved higher accuracy than the $V L$ method in all cases with 3–4–5 letters. The same experiment for the case of 7 cubes is shown in Table 7. Interestingly, the standard deviation of $L F$ method was slightly high (4.3%) on the Uzbek dataset, which meant that the results over 5-fold cross-validation fluctuated.

In the case of 7 cubes, the $L F$ method achieved reasonably high coverage (over 90%) in the English and Slovenian languages in 3–4 letter cases. Although the $V L$ gained comparable results with the $L F$ model on the Uzbek and English datasets, it resulted in approximately 15% lower average coverage on the Russian and Slovenian datasets. The average coverage of both models increased around 17–20% for the case of 5 letter words in all datasets, but this result was still not what we desired, so we continued the experiment with 8 cubes, as illustrated in Table 8.

The results in Table 8 show that $V L$ model achieved better coverage than the $L F$ model on all datasets, except the Russian. More precisely, the improvement of the $V L$ model was 7% for the Uzbek, 4.7% for the English, and 2.1% for the Slovenian languages compared to the $L F$ model. The main reason is that the $V L$ model includes more vowel letters in the case of 8 cubes, compared to the cases of 5, 6, or 7 cubes. In general, all models achieved higher results in all datasets in the case of 8 cubes, with, especially, over 98% in the 3 letter case and over 95% in the 4 letter case for the Uzbek, Slovenian, and English datasets. This was an expected behavior because the chance of constructing 3–4–5 letter words from 8 cubes (containing 48 letters) became really high. In all experiments, the $V L$ method resulted in a lower result than the $L F$ approach on the Russian dataset, because when the number of cubes increased, the proposed methods tended to achieve reasonably high coverage on all datasets. We achieved the intended coverage with 8 cubes, and, therefore, this version was the last case of the experiment. The average coverages, based on 5–8 cubes obtained by $L F$ and $V L$ models, are presented in Figure 5.

Figure 5 illustrates that both models achieved comparable results in terms of average coverage under 5–8 cube conditions. $L F$ method gained slightly better results than $V L$ method on 3–4–5 letter average coverage for Slovenian and Russian datasets, while this result was comparable in Uzbek and English datasets.

To check the time complexity of the proposed models, we recorded execution time while performing the experiment, shown in Table 9. The experiments were performed on a computer with an Intel Core i5 processor and 8 GB of RAM.

It can be seen from Table 9 that average execution times of both models (over 5-fold) ranged between 0.04–0.06 seconds in all datasets.

6. Discussion of Results

The experimental results showed that both the $L F$ and $V L$ models achieved reasonable dataset coverage across various languages. The coverage of both models significantly improved with increase in the number of cubes, which was an expected behaviour. The proposed models achieved the intended coverage with 8 cubes which was selected as an optimal number of cubes. Once we generated the cubic letters by the $L F$ and $V L$ models, we further wondered if it was possible to improve the overall coverage by considering all the combinations of swapping the letters between any pairs of cubes (found by $L F$ and $V L$ models).

$L F_{O p t}$ utilized a simple stochastic method, based on the letters found by $L F$ , and $V L_{O p t}$ used the same method for letters generated by the $V L$ model.

$L F_{O p t}$ achieved 94.7% overall coverage, which was 5.8% higher than the $L F$ method on the Uzbek dataset, 94.4% (2.3% better) on the English dataset, 88.0% (3.9% better) on the Russian dataset and 94.2% (2.1% better) on the Slovenian dataset.

The $V L_{O p t}$ method attained overall coverage of 96.3% and 97.5% on the Uzbek and English datasets, which represented similar results to the $V L$ method. Similarly, the $V L_{O p t}$ method achieved 75.0% coverage on the Russian dataset, which was 3.1% better than the $V L$ method. Finally, on the Slovenian dataset, the $V L_{O p t}$ method yielded a coverage of 94.8%, indicating a 0.6% improvement compared to the $V L$ method.

Although using the optimization technique improved the coverage slightly, their time complexity was significantly worse than that of the proposed models ( $L F$ and $V L$ ). This means that the optimization method is significantly more time-consuming and may not be practical for large datasets.

7. Conclusions and Future Work

A matching-letter game is an essential tool for a child to improve letter recognition and vocabulary, as well as orthography. In this paper, we proposed two models, based on Letter Frequency ( $L F$ ) and Vowel Priority ( $V L$ ) methods, for modeling a cubic-oriented word game. Experimental evaluations showed that both models had their own advantages, depending on the number of cubes. In the case of 8 cubes, the $V L$ model achieved higher overall coverage (over 94%) than the $L F$ approach (over 89%) on Uzbek, English and Slovenian datasets, because the datasets of those languages have less frequent vowels. Both models covered around 99% of 3-letter words in the Uzbek, English and Slovenian datasets, while this coverage was over 85% in 5-letter words. Both models can be applied to other languages by providing their alphabets and datasets consisting of 3–5 letter words.

The results obtained in this research suggest that, while our models can provide a good starting point for designing word games, there is still room for further optimization using stochastic methods. Future research can explore the use of more advanced stochastic optimization methods to improve the overall coverage, while minimizing the number of cubes used.

Author Contributions

Conceptualization, J.M., U.S. and B.K.; methodology, J.M. and U.S.; software, U.S.; validation, J.M. and U.S.; formal analysis, J.M., U.S. and B.K.; writing—original draft preparation, J.M. and U.S.; writing—review and editing, J.M. and B.K. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The dataset used in this paper can be found on https://github.com/UlugbekSalaev/Cubic-Word-Game-Modeling/tree/master/Dataset (accessed on 1 October 2022).

Acknowledgments

The first and third authors acknowledge the Slovenian Research Agency ARRS for funding the project J2-2504. They also gratefully acknowledge the European Commission for funding the InnoRenewCoE project (Grant Agreement #739574) under the Horizon2020 Widespread-Teaming program and the Republic of Slovenia (Investment funding of the Republic of Slovenia and the European Union of the European Regional Development Fund). The first author also would like to sincerely thank the Ministry of “Innovative Development” of the Republic of Uzbekistan for funding this research.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. The examples of Uzbek letter-match word game: (a) 3-letter word; (b) 4-letter word; (c) 5-letter word.

Figure 2. The examples of English letter-match word game: (a) 3-letter word; (b) 4-letter word; (c) 5-letter word.

Figure 3. The examples of Russian letter-match word game: (a) 3-letter word; (b) 4-letter word; (c) 5-letter word.

Figure 4. Overview of the proposed model.

Figure 5. Comparison of [Forumla omitted. See PDF.] and [Forumla omitted. See PDF.] models in average coverage in the case of 5–8 cubes.

Table 1

Etter frequencies for the dataset-represented letter and its frequency in percentage (highlighted with bold vowels when letter frequencies were higher than 5% of the overall frequency of the dataset).

UZ		EN		RU		SL
a	14.06	e	10.41	а	10.77	a	11.17
i	8.27	a	9.21	o	8.99	e	9.28
o	8.23	s	8.51	к	6.20	o	8.54
r	5.99	o	7.36	р	6.01	r	6.79
l	5.15	r	6.01	т	5.90	i	6.56
s	4.93	l	5.44	е	5.24	t	5.34
t	4.91	t	5.41	л	4.98	n	5.31
u	4.78	i	5.39	и	4.83	k	5.26
n	4.35	d	4.60	с	4.58	l	4.66
m	3.70	n	4.29	н	4.48	s	4.04
q	3.58	c	4.05	у	3.62	p	3.55
k	3.39	u	3.68	м	3.11	v	3.33
y	3.12	b	3.59	п	3.07	d	3.25
h	3.02	p	3.21	в	3.02	u	2.94
b	2.94	m	2.92	д	2.97	m	2.86
e	2.63	h	2.70	б	2.71	b	2.57
d	2.50	g	2.47	ь	2.43	j	2.36
z	2.50	f	2.24	г	2.02	c	2.29
v	2.13	y	2.23	з	1.90	g	1.89
ō	1.94	k	2.03	я	1.70	z	1.86
p	1.43	w	1.73	ч	1.41	š	1.67
f	1.39	v	1.08	ы	1.37	ž	1.36
g	1.26	x	0.52	ш	1.31	h	1.20
j	1.19	z	0.42	й	1.31	č	1.12
ḡ	1.11	j	0.42	х	1.26	f	0.80
x	0.96	q	0.10	ж	1.18
c	0.55			ё	0.96
				ф	0.87
				ц	0.56
				ю	0.54
				щ	0.34
				э	0.31
				ъ	0.04

Table 2

A common pattern of vowel and consonant letter positional occurrences and their portions within 3–5 letter words for each dataset (*—vowel, #—consonant).

Dataset	5 Letter		4 Letter		3 Letter
UZ	###	49%	#*##	30%	#*#	80%
	###	22%	##	27%	*# #	10%
	###	15%	##	22%	#	9%
	##*	3% the	##	11%
	##*##	2%	##*#	7%
	###	2%
	###	2%
EN	###	21%	#*##	37%	#*#	59%
	##*##	13%	##	22%	#**	13%
	#*###	11%	#**#	16%	*##	9%
	###	11%	##*#	12%	#	7%
	#**##	8%	##	4%	**#	5%
	###	6%	##	2%	##*	4%
	##**#	6%	##**	2%
	###	5%
RU	###	36%	##	40%	#*#	68%
	###	24%	##*#	22%	#	12%
	###	13	#*##	15%	##*	6%
	###	7%	##	11%	#**	4%
	##*	5%	##	5%	#**	4%
	##*	4%	#**#	4%	*##	4%
	##*##	3%
SL	###	46%	##	37%	#*#	77%
	###	19%	##*#	27%	###	7%
	###	10%	#*##	12%	#	6%
	###	7%	##	12%	##*	4%
	###*#	5%	##	4%	*##	3%
	##*	3%	###*	3%	#**	2%
	##*##	2%	#**#	2%

Table 3

Description of datasets.

Dataset	# of Word	# of 3 Letter Word	# of 4 Letter Word	# of 5 Letter Word
UZ	4558	518	1165	2875
EN	6024	1026	2499	2499
RU	3069	516	1285	2507
SL	6280	515	1598	4167

Table 4

Distribution of words by different word types for the datasets.

Word Classes	Uzbek	English	Russian	Slovenian
Adjective	663	157	142	1903
Adverb	106	82	264	30
Noun	3117	5496	3438	4075
Verb	566	219	139	223
Numeral	13	9	17	25
Determiner	16	10	-	13
Pronoun	13	11	14	6
Other	64	40	294	5

Table 5

Overall coverage (%) with a standard deviation of both methods in the case of 5 cubes.

Dataset	Method	Total	3 Letter	4 Letter	5 Letter
UZ	$L F$	$25.9 \pm 1.6$	$67.7 \pm 4.7$	$40.1 \pm 3.0$	$12.6 \pm 0.4$
UZ	$V L$	$26.1 \pm 1.6$	$68.5 \pm 4.4$	$40.3 \pm 3.9$	$12.7 \pm 0.4$
EN	$L F$	$34.8 \pm 1.8$	$68.7 \pm 3.5$	$41.5 \pm 2.8$	$14.0 \pm 1.5$
EN	$V L$	$33.0 \pm 1.4$	$69.8 \pm 3.3$	$38.0 \pm 2.2$	$12.8 \pm 1.6$
RU	$L F$ , $V L$	$13.0 \pm 1.3$	$48.6 \pm 3.9$	$18.2 \pm 2.7$	$3.0 \pm 0.6$
SL	$L F$	$29.1 \pm 1.9$	$78.1 \pm 4.4$	$45.7 \pm 1.5$	$16.8 \pm 1.9$
SL	$V L$	$26.8 \pm 1.5$	$74.4 \pm 3.6$	$45.2 \pm 4.3$	$13.8 \pm 1.0$

Table 6

Overall coverage (%) with standard deviation of both methods in the case of 6 cubes.

Dataset	Method	Total	3 Letter	4 Letter	5 Letter
UZ	$L F$	$58.6 \pm 4.3$	$88.4 \pm 4.3$	$73.6 \pm 6.2$	$47.2 \pm 3.7$
UZ	$V L$	$53.3 \pm 1.6$	$84.8 \pm 2.9$	$67.9 \pm 3.0$	$41.7 \pm 1.5$
EN	$L F$	$68.6 \pm 2.0$	$88.8 \pm 1.5$	$73.8 \pm 1.8$	$55.1 \pm 2.7$
EN	$V L$	$64.6 \pm 1.0$	$85.6 \pm 2.6$	$70.0 \pm 1.9$	$50.5 \pm 1.8$
RU	$L F$	$36.5 \pm 1.3$	$71.3 \pm 1.9$	$49.6 \pm 3.3$	$23.1 \pm 1.3$
RU	$V L$	$36.1 \pm 1.8$	$70.2 \pm 6.5$	$49.4 \pm 1.7$	$22.9 \pm 2.9$
SL	$L F$	$67.3 \pm 2.4$	$93.3 \pm 2.4$	$81.5 \pm 1.0$	$58.7 \pm 2.9$
SL	$V L$	$58.6 \pm 2.8$	$88.4 \pm 3.5$	$70.3 \pm 3.5$	$50.4 \pm 2.9$

Table 7

Overall coverage (%) with a standard deviation of both methods in the case of 7 cubes.

Dataset	Method	Total	3 Letter	4 Letter	5 Letter
UZ	$L F$	$77.2 \pm 1.1$	$95.7 \pm 2.5$	$87.2 \pm 2.5$	$69.9 \pm 1.3$
UZ	$V L$	$77.4 \pm 1.6$	$95.4 \pm 2.1$	$87.9 \pm 2.0$	$70.0 \pm 2.1$
EN	$L F$	$87.4 \pm 2.8$	$96.7 \pm 1.2$	$91.1 \pm 2.2$	$79.9 \pm 5.3$
EN	$V L$	$86.6 \pm 0.7$	$94.6 \pm 1.7$	$89.8 \pm 1.3$	$80.0 \pm 1.1$
RU	$L F$	$66.2 \pm 1.5$	$89.2 \pm 1.5$	$78.6 \pm 3.6$	$55.0 \pm 1.5$
RU	$V L$	$51.5 \pm 0.8$	$83.6 \pm 2.3$	$64.6 \pm 2.3$	$38.3 \pm 1.1$
SL	$L F$	$87.7 \pm 1.2$	$97.9 \pm 1.1$	$93.6 \pm 1.3$	$84.2 \pm 1.3$
SL	$V L$	$73.7 \pm 1.7$	$94.4 \pm 2.2$	$83.4 \pm 2.3$	$67.4 \pm 1.8$

Table 8

Overall accuracy (%) with a standard deviation of both methods in the case of 8 cubes.

Dataset	Method	Total	3 Letter	4 Letter	5 Letter
UZ	$L F$	$88.9 \pm 2.1$	$99.4 \pm 0.8$	$95.1 \pm 1.8$	$84.5 \pm 2.9$
UZ	$V L$	$95.9 \pm 1.1$	$99.4 \pm 0.7$	$98.1 \pm 0.4$	$94.4 \pm 1.6$
EN	$L F$	$92.1 \pm 1.9$	$98.7 \pm 0.5$	$95.1 \pm 1.8$	$86.8 \pm 3.4$
EN	$V L$	$96.8 \pm 0.4$	$99.1 \pm 0.5$	$97.8 \pm 0.3$	$94.7 \pm 1.2$
RU	$L F$	$84.1 \pm 1.6$	$95.9 \pm 1.4$	$89.8 \pm 1.9$	$78.9 \pm 1.9$
RU	$V L$	$71.9 \pm 1.5$	$89.5 \pm 4.8$	$80.5 \pm 3.2$	$64.0 \pm 2.4$
SL	$L F$	$92.1 \pm 0.9$	$99.6 \pm 0.5$	$97.0 \pm 0.8$	$89.3 \pm 1.3$
SL	$V L$	$94.2 \pm 0.5$	$99.4 \pm 0.5$	$97.2 \pm 0.7$	$92.4 \pm 0.6$

Table 9

Overall execution times in seconds (over 5-fold) of the proposed models.

Dataset	Methods	Number of Cubes
Dataset	Methods	5	6	7	8
UZ	$L F$	0.0535	0.0469	0.0469	0.0353
UZ	$V L$	0.0472	0.0535	0.0530	0.0647
EN	$L F$	0.0625	0.0752	0.0630	0.0469
EN	$V L$	0.0535	0.0470	0.0595	0.0637
RU	$L F$	0.0687	0.0474	0.0535	0.0533
RU	$V L$	0.0469	0.0593	0.0470	0.0470
SL	$L F$	0.0690	0.0625	0.0667	0.0848
SL	$V L$	0.0625	0.0556	0.0677	0.0486

References

1. Viera, R.T. Vocabulary knowledge in the production of written texts: A case study on EFL language learners. Rev. Tecnol. ESPOL (RTE); 2017; 30, pp. 89-105.

2. Alqahtani, M. The importance of vocabulary in language learning and how to be taught. Int. J. Teach. Educ.; 2015; 3, pp. 21-34. [DOI: https://dx.doi.org/10.20472/TE.2015.3.3.002]

3. Azar, A.S. The Effect of Games on EFL Learners’ Vocabulary Learning Strategies. Int. J. Basic Appl. Sci.; 2012; 1, pp. 252-256. [DOI: https://dx.doi.org/10.17142/ijbas-2012.1.2.10]

4. Rohani, M.; Pourgharib, B. The Effect of Games on Learning Vocabulary. Int. J. Basic Appl. Sci.; 2013; 4, pp. 3540-3543.

5. Alavi, G.; Gilakjani, A.P. The Effectiveness of Games in Enhancing Vocabulary Learning among Iranian Third Grade High School Students. Malays. J. Elt Res.; 2019; 16, Available online: https://scholar.google.com.sg/scholar?hl=zh-TW&as_sdt=0%2C5&q=The+Effectiveness+of+Games+in+Enhancing+Vocabulary+Learning+among+Iranian+Third+Grade+High+School+Students.&btnG= (accessed on 1 October 2022).

6. Mageda, N.; Amaal, M. The Effect of Using Word Games on Primary Stage Students Achievement in English Language Vocabulary in Jordan. Am. Int. J. Contemp. Res.; 2014; 4, pp. 144-152.

7. Shchukina, T.J.; Mardieva, L.A.; Alyokine, T.A. Teaching Russian Language: The Role of Word Formation. Teacher Education-IFTE 2016, Volume 12. European Proceedings of Social and Behavioural Sciences; Valeeva, R. Future Academy: Kazan, Russia, 2016; pp. 190-196. [DOI: https://dx.doi.org/10.15405/epsbs.2016.07.31]

8. Whitney, C. How the brain encodes the order of letters in a printed word: The SERIOL model and selective literature review. Psychon. Bull. Rev.; 2001; 8, pp. 221-243. [DOI: https://dx.doi.org/10.3758/BF03196158] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/11495111]

9. Aristides, V.; Monica, G.; Maria, F.; Christos, T. Utilizing NLP Tools for the Creation of School Educational Games. Educating Engineers for Future Industrial Revolutions, Proceedings of the ICL 2020. Advances in Intelligent Systems and Computing, Tallin, Estonia, 23–25 September 2020; Auer, M.E.; Rüütmann, T. Springer: Tallin, Estonia, 2020; Volume 1328, [DOI: https://dx.doi.org/10.1007/978-3-030-68198-2_62]

10. Hojiev, A. O’zbek Tili Morfologiyasi, Morfemikasi va so’z Yasalishining Nazariy Masalalari; O’zbekiston Respublikasi Fanlar Akademiyasi Fan Nashriyoti: Tashkent, Uzbekistan, 2010.

11. Salaev, U.; Kuriyozov, E.; Gómez-Rodríguez, C. A machine transliteration tool between Uzbek alphabets. Proceedings of the International Conference on Agglutinative Language Technologies as a Challenge of Natural Language Processing; Primorska, Slovenia, 7 June 2022.

12. Salaev, U. Computational Methods Using Character Statistics for the Word Game. Proceedings of the Computer Linguistics: Problems, Solutions, Prospects; Tashkent, Republic of Uzbekistan, 19–22 September 2006; Volume 1, pp. 238-243.

13. Zaitun, M.; Fitri, A.J. E. Big Cube Game: An Instructional Medium Used in Students’ Vocabulary Mastery. J. Engl. Lit. Educ.; 2020; 7, pp. 101-106.

14. Vu, N.N.; Linh, P.T.M.; Lien, N.T.H.; Van, N.T.T. Using Word Games to Improve Vocabulary Retention in Middle School EFL Classes. Proceedings of the 18th International Conference of the Asia Association of Computer-Assisted Language Learning (AsiaCALL–2-2021), Advances in Social Science, Volume 621, Education and Humanities Research; Ho Chi Minh, Vietnam, 26–27 November 2021.

15. Anugerah, R.; Wijaya, B.; Bunau, E. The Use Of Build-A-Sentence Cubes Game In Teaching Simple Past Tense. J. Pendidik. Dan Pembelajaran Khatulistiwa (JPPK); 2016; 5, [DOI: https://dx.doi.org/10.26418/jppk.v5i3.14526]

16. Madvaliyev, A.; Begmatov, E. The title of the cited contribution. O’zbek Tilining Imlo Lug’ati; Mahmudov, N. Akadem-nashr: Tashkent, Uzbekistan, 19 May 2012.

17. OpenCorpora: An Open Source Initiative for Building a Free and Comprehensive Corpora for Russian and Other Slavic Languages. Available online: http://opencorpora.org/ (accessed on 15 October 2022).

18. Kaja, D.; Simon, K.; Peter, H.; Tomaž, E.; Miro, R.; Špela, A.H.; Jaka, Č.; Luka, K.; Marko, R.-Š. Morphological Lexicon Sloleks 2.0, Slovenian Language Resource Repository CLARIN.SI, ISSN 2820-4042. 2019; Available online: http://hdl.handle.net/11356/1230 (accessed on 10 October 2022).

19. Jurafsky, D.; Martin, J.H. N-gram Language Models. Speech and Language Processing, Third Edition Draft; Stanford University: Stanford, CA, USA, 2022; pp. 30-57.

20. Rice, G.A.; Robinson, D.O. The role of bigram frequency in the perception of words and nonwords. Mem. Cogn.; 1975; 3, pp. 513-518. Available online: https://link.springer.com/content/pdf/10.3758/BF03197523.pdf (accessed on 1 November 2022). [DOI: https://dx.doi.org/10.3758/BF03197523] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24203873]

Word count: 7123

Show less

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Word games are one of the most essential factors of vocabulary learning and matching letters to form words for children aged 5–12. These games help children to improve letter and word recognition, memory-building, and vocabulary retention skills. Since Uzbek is a low-resource language, there has not been enough research into designing word games for the Uzbek language. In this paper, we develop two models for designing the cubic-letter game, also known as the matching-letter game, in the Uzbek language, consisting of a predefined number of cubes, with a letter on each side of each six-sided cube, and word cards to form words using a combination of the cubes. More precisely, we provide the opportunity to form as many words as possible from the dataset, while minimizing the number of cubes. The proposed methods were created using a combination of a character-level n-gram model and letter position frequency in words at the level of vowels and consonants. To perform the experiments, a novel dataset, consisting of 4.5 k 3–5 letter words, was created by filtering based on child age groups for the Uzbek language, and three more datasets were generated, based on the support of experts for the Russian, English, and Slovenian languages. Experimental evaluations showed that both models achieved good results in terms of average coverage. In particular, the Vowel Priority (VL) approach obtained reasonably high coverage with 95.9% in Uzbek, 96.8% in English, and 94.2% in the Slovenian language in the case of eight cubes, based on the five-fold cross-validation method. Both models covered around 85% of five letter words in Uzbek, English, and Slovenian datasets, while this coverage was even higher (99%) in three letter words in the case of eight cubes.

Details

Title

Word Game Modeling Using Character-Level N-Gram and Statistics

Author

Mattiev, Jamolbek¹

; Salaev, Ulugbek¹

; Kavsek, Branko²

¹ Information Technologies Department, Urgench State University, Khamid Alimdjan 14, Urgench 220100, Uzbekistan
² Department of Information Sciences and Technologies, University of Primorska, Glagoljaška 8, 6000 Koper, Slovenia; AI Laboratory, Jožef Stefan Institute, Jamova Cesta 39, 1000 Ljubljana, Slovenia

First page

1380

Publication year

2023

Publication date

2023

Publisher

MDPI AG

e-ISSN

22277390

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/math11061380

ProQuest document ID

2791672208

Word Game Modeling Using Character-Level N-Gram and Statistics

Jump to:

Full text

Abstract

Details

Suggested sources