Content area
Despite being largely spoken and studied by language and cognitive scientists, Italian lacks large resources of language processing data. The Italian Crowdsourcing Project (ICP) is a dataset of word recognition times and accuracy including responses to 130,465 words, which makes it the largest dataset of its kind item-wise. The data were collected in an online word knowledge task in which over 156,000 native speakers of Italian took part. We validated the ICP dataset by (1) showing that ICP reaction times correlate strongly (r = .78) with lexical decision latencies collected in a traditional lab experiment, (2) showing that the effect of major psycholinguistic variables (e.g., frequency, length, etc.) can be replicated in this dataset, and (3) replicating the effect of word prevalence, which we compute here for the first time for Italian. Given the inclusion of many inflectional forms of verbs, adjectives, and nouns, we further showcase the potential of this dataset by exploring two phenomena (inflectional entropy in verb paradigms and the clitic effect in isolated word recognition) that build on the peculiar properties of Italian. In this paper we present the ICP resource and release response times, accuracy, and prevalence estimates for all the words included.
Introduction
Over the years, research on visual word recognition has increasingly relied on large sets of data because they are better suited for explaining the effects of multiple word features simultaneously, for examining the influence of novel variables, and for testing the predictions of cognitive models of visual word processing (Balota et al., 2012; Mandera et al., 2020; Keuleers & Balota, 2015).
The kind of resources needed for large-scale analyses consist of megastudies, large-scale experiments with hundreds of participants responding to hundreds or thousands of stimuli. The most obvious advantage of megastudies is their statistical power, but this is not their only appeal. Another advantage is that megastudies are designed around a task or paradigm (e.g., lexical decision, word naming, priming, etc.), rather than investigating a specific type of stimuli or effects. This means that participants do not respond to a limited selection of items, and thus there is less risk of item selection bias from the researcher (see Amenta et al., 2017). The large number of stimuli further allows for continuous analysis of variables (e.g., frequency) so that effects of word features can be assessed across their entire range (Balota et al., 2012; Keuleers & Balota, 2015). Probably the most relevant advantage of megastudies is that the same dataset can be used multiple times to test different hypotheses or even to develop new ones. This is particularly relevant for testing the predictions of competing cognitive models (Balota et al., 2012).
The megastudy approach first emerged in visual word processing and was popularized by the English Lexicon Project (ELP, Balota et al., 2007), which collected lexical decision and naming data for 40,418 English words on a total of 1260 participants. The ELP started a new research tradition based on the analysis of pre-collected behavioral resources. To date, the ELP has been cited and used more than 2500 times to assess the influence of a variety of word features and metrics. It was soon followed by several megastudies in other languages using a similar procedure. The French Lexicon Project (Ferrand et al., 2010) included lexical decision latencies for 38,840 words by 975 participants, while the Dutch Lexicon Project 1 and 2 (Brysbaert et al., 2016; Keuleers et al., 2010) included cumulative lexical decision response times for 44,000 words by 120 participants. The Malay Lexicon Project (Maziyah Mohamed et al., 2023; Yap et al., 2010) included behavioral data (lexical decision latencies) and lexical features for 9592 words. The British Lexicon Project (Keuleers et al., 2012) included lexical decision times for 28,730 words by 78 participants. Finally, the Chinese Lexicon Project (Sze et al., 2014) contained lexical decision responses for 2500 characters and was followed by two additional lexicon projects, one with lexical decision responses for over 25,000 two-character compositions (Tse et al., 2017) and one with naming latencies for the same material (Tse et al., 2022). These resources contributed to rapid advances in word processing research because they not only allowed researchers to test the effects of different variables and metrics (see Mandera et al., 2020 for a brief overview) based on large amounts of data, but also introduced the ability to validate the influence of variables cross-linguistically.
One limitation of these megastudies is that data are still collected in a physical laboratory at a relatively high cost in terms of time and resources. Moreover, lexicon projects inherited another limitation of laboratory-based psychological experiments. They almost all worked with university students, which reduces generalization to other groups (Henrich et al., 2010). An obvious solution to these limitations was to investigate word processing outside the context of university laboratories, using online crowdsourcing.
Crowdsourcing, as the name suggests, consists of sharing a particular task or question with a large community of individuals, asking for their cooperation. As such, it is widely used in engineering and computer science, but only fairly recently has it found its application in language and psychological sciences (Hartshorne et al., 2019; Keuleers & Balota, 2015). In this context, crowdsourcing consists of collecting large samples of behavioral data (e.g., reaction times) or human intuitions (e.g., concreteness assessments), usually relying on social media or on web platforms such as Amazon Turk or Prolific Academic where individuals register and offer their time to perform a series of tasks or experiments. Crowdsourcing offers the advantage of enrolling participants with different backgrounds and demographic characteristics (in terms of, for example, age, geographic origin, education)—all variables known to influence language processing—which has allowed researchers to test individual differences in language processing on the one hand and increase the generalizability of experimental results on the other (e.g., Kyröläinen et al., 2021).
The Dutch Crowdsourcing Project (Brysbaert et al., 2019a) collected visual word recognition data for over 54,400 words based on 410,000 valid sessions,1 resulting in about 26 million responses. In contrast to the previous lexicon project, which employed a lexical decision task, the crowdsourcing project assessed word knowledge: Participants were asked whether or not they knew the word they were reading. One third of the stimuli were pseudowords, to prevent the participants from indicating “yes” to all stimuli. There were no response time constraints, so that participants could take the time they needed to give a response, as the authors did not expect response speed to be of interest in this task. Subsequently, however, they discovered that the response times in the Dutch Crowdsourcing Project correlated well with the Dutch megastudies collected previously (around .7). Other studies have confirmed that response data from yes/no decision tasks correlate well with vocabulary tests asking participants to associate a meaning to a given word (e.g., Stubbe, 2012; Yap et al., 2012).
The Dutch Crowdsourcing Project was followed by the even larger English Crowdsourcing Project (Brysbaert et al., 2019b; Mandera et al., 2020) and the Spanish Crowdsourcing Project (Aguasvivas et al., 2018, 2020). The English study involved about 220,000 participants (considering only native speakers) and provided recognition time and accuracy for about 62,000 words, while the Spanish study involved about 150,000 participants, all native speakers from about 20 Spanish-speaking countries, and provided recognition time and accuracy for about 45,000 Spanish words.
Corpora of this size allow researchers to study language processing taking into account specific phenomena related to word characteristics or participant differences, which cannot be studied with smaller datasets and homogeneous samples. One major outcome was the quantification of word prevalence. Given the huge number of participants and evaluations per word (about 380 in the English study and 330 in the Spanish study), the authors were able to calculate how many people know a particular word, a variable called word prevalence. Keuleers et al. (2015) proposed that prevalence may be a measure of word occurrence, which may complement word frequency. In fact, while word frequency represents a count measure based on text corpora, prevalence captures information regarding word knowledge directly from speakers and reflects the probability of a word being encountered in everyday language use. Prevalence has been shown to be a particularly effective predictor of reaction times in visual word recognition, especially for low-frequency words where word frequency tends to be uninformative (Brysbaert et al., 2016, 2019b; Keuleers et al., 2015; Mandera et al., 2020). In fact, given its distribution, when considering a large text corpus (in the millions-of-words range), most words tend to have a low frequency (less than one occurrence per million words; Brysbaert et al., 2018), while a few words have extremely high frequency (over 100 occurrences per million words). However, when looking at the low-frequency words, most of them are quite common and have a high probability of being known by speakers. Here is where prevalence is most effective: in gauging which words are actually known by speakers even if they have low corpus frequency.
Moreover, datasets of this size make it possible not only to examine effects specific to word characteristics, but also to examine effects associated with participant differences. While participants in laboratory studies tend to be homogeneous in terms of age, education, and geographic location (most are university students), crowdsourcing makes it possible to collect data from a variety of individuals with different language experiences. For example, the Dutch studies (Brysbaert et al., 2016; Keuleers et al., 2015) examined the impact of age, education, and location (respondents were Dutch speakers in Belgium or the Netherlands) on vocabulary size and showed that all three had a significant impact on word knowledge. Similar results came from the Spanish study (Aguasvivas et al., 2020) that compared word knowledge among Spanish speakers living in different Spanish-speaking countries.
The purpose of the current study is to provide crowdsourcing data for the Italian language and to demonstrate its theoretical and empirical relevance to psycholinguistic research in Italian. Despite the fact that the Italian language is spoken by about 68 million people (Ethnologue, 2023) and has been studied in many psycholinguistic and linguistic studies, a general-purpose resource for the language does not exist, as there is no Italian lexical megastudy. Unlike other crowdsourcing projects, we decided to focus on a large variety of stimuli, collecting data for more than 130,000 words, including many inflectional forms of verbs, adjectives, and nouns. We believe that, given the inflectional richness of the Italian language, a database with a large number of inflected forms will be particularly useful for scientists interested in morphological processing.
In the following, we present the data we collected and validated with human responses collected in a laboratory experiment. To give an idea of the potential applications of the proposed resource, we present two case study examples. First, we show how the collected data can be used to extend previously investigated phenomena, namely the inflectional entropy effect. In the second case study, we present a possible new effect for which Italian is ideally suited as a testing environment, namely the “clitic2 effect” in isolated word processing.
The Italian Crowdsourcing Project
Methods
The Italian Crowdsourcing Project (ICP) was developed by Ghent University (Belgium) in collaboration with the University of Milano-Bicocca (Italy). It used the same online platform and methods as the Dutch and English Crowdsourcing Projects (see Brysbaert et al., 2019a; Mandera et al., 2020). The study was approved by the ethical committee of the University of Milano-Bicocca (Approval Protocol Number 310/2017).
Materials
Words were initially selected from SUBTLEX-IT (Crepaldi et al., 2015). In contrast to previous crowdsourcing projects, the stimulus set also included inflected forms and clitics. Thus, in addition to the entries extracted from SUBTLEX, we also included (i) all inflected forms for nouns, (ii) one random inflected form for adjectives (e.g., feminine or plural), (iii) the whole paradigm for the top 60 most frequent verbs, and (iv) two random inflected forms for all other (less frequent) verbs. Using a combination of stop lists and manual inspection, we removed proper names (in general all entries marked with a capital letter), special characters, punctuation marks, and one-letter words that did not have their own meaning. The list thus included 126,000 items. This initial list was then enriched with lemmas from an online natural language processing (NLP) list containing 90,000 words, most of which were morphologically complex, and with entries from the Sabatini–Coletti dictionary (mostly compounds and rare complex forms; 5,000 lemmas). This addition was aimed at including morphologically complex words that might be known even if not frequently used (and hence not included in SUBTLEX-IT). Finally, we included a list of 57 web-, gaming-, and marketing-related neologisms (most of which were a blend of English words and Italian morphology, e.g., “uploadare”) that we deemed interesting from a socio-linguistic perspective. The resulting list was finally filtered with a combination of stop lists and manual inspection in order to remove very offensive words (e.g., racial and sexual slurs) and repetitions (resulted from overlaps between lists).
The final list thus included 130,495 words. Figure 1 shows the distributions of word length and word frequency in the word item list. Word length ranged from 1 to 26 letters (M = 9.71, SD = 2.58). Word frequency, retrieved from SUBTLEX-IT (Crepaldi et al., 2015) and expressed as Zipf scores (Brysbaert et al., 2018), ranged from 0 (not present in the SUBTLEX-IT corpus) to 7.26 (M = 1.36, SD = 1.28).
[See PDF for image]
Fig. 1
Frequency distribution (Zipf-transformed) and length (in letters) of word items included in the ICP
The pseudowords were generated with Wuggy (Keuleers & Brysbaert, 2010) and were created to be as similar as possible to Italian words, taking into account phonotactic rules (e.g., consonant–vowel structure, probability of transition within and between syllables), length, and morphological structure (some pseudowords contained existing affixes). Once created, the pseudoword list was inspected by the Italian-speaking authors to manually remove pseudowords that were in fact words (some cases of pseudowords were indeed words with very low frequency or words belonging to Italian dialects). Moreover, in the first weeks of the experiment, the authors used feedback from participants to further prune the list of pseudowords, as a few more items turned out to be existing words in Italian dialects. After pruning, the final list contained 17,606 pseudowords.
Procedure
The test was accessible online at the address www.vocabolario.ugent.be. It could be completed with a personal computer or a mobile device (phone or tablet). The interface and instructions differed slightly for the two formats. While in the PC version participants had to press two specific buttons to answer (J for “yes” answers and F for “no” answers; see below), in the mobile version they were instructed to press virtual YES or NO buttons on the screen.
Before beginning the test, participants were informed about the scope of the study and data treatment and gave explicit consent to participate. They then filled out a short questionnaire to collect relevant information (e.g., age, gender, handedness, place where they grew up, level of education, first language, number of other languages they spoke and which language they spoke best, self-assessed proficiency in Italian). Participants could skip the questionnaire and go directly to the test. They could also save the questionnaire data so that they did not have to fill in the questionnaire anew if they decided to participate in the test more than once.
As in previous studies, participants were shown 100 letter strings (70 existing words and 30 pseudowords)3 and asked to press the J key if they knew the word on the screen, or the F key if they did not know the word (word knowledge task). They were also warned not to answer “yes” to strings they did not know, as that would result in a penalty. In addition, they were instructed to expect words to occur in both base form and inflected forms. Examples were given for all grammatical classes, including clitic forms (e.g., “andarci,” (to) go there; “credimi,” believe me). They were informed that the test would take about 4 minutes and that they could take the test again (in which case they would be given new stimuli).
For the sake of comparability, we adopted the same setup as previous crowdsourcing studies; hence, participants were not instructed to respond to each item as fast as possible, as is customary in lexical decision tasks. Note that previous literature has shown that reaction times collected under this condition are still reliable if based on more than 80 observations per word (e.g., Mandera et al., 2020; Aguavivas et al., 2022; see also the analysis of reliability below, “Reliability, analyses and discussion” section).
At the end of the test, participants were given an estimate of their vocabulary by subtracting the percentage of “yes” answers to pseudowords (false alarms) from the percentage of “yes” answers to words (hits). For example, someone who answered “yes” to 51 words and to 2 pseudowords would receive a score of 51/70–2/30 = .66, indicating that they knew 66% of the Italian words. The estimate was only an approximation of a person’s actual vocabulary, but it mainly served to motivate participants to participate in the test and share their results with others. This gamified procedure constituted the test’s main dissemination channel. In addition, the test was advertised in national newspapers (La Repubblica), web magazines (Vice), blogs, and other online platforms (e.g., Reddit).
Results and discussion
Preprocessing
The raw data consisted of 26,260,500 data points, corresponding to 262,605 tests completed by 206,010 user profiles. For the analyses in this work, we followed a similar procedure as in Mandera et al. (2020). This was a standard pipeline applied to the data, without looking at data patterns yet.
We looked only at the first three sessions of each user profile, to balance out the influence of individual profiles (participants could perform multiple sessions). This reduced the dataset to 24,833,600 observations. The first nine trials of each session were deleted, as they were considered to be warm-up trials. The number of observations at this point was 22,350,240. Trials with response times longer than 8000 ms were removed to ensure that participants were not consulting dictionaries or inquiring about the existence of a particular item. This further reduced the dataset to 22,003,390 data points. In the next step, outliers were filtered out on the basis of an adjusted boxplot method for positively skewed distributions (Hubert & Vandervieren, 2008; see also Mandera et al., 2020): outlier removal was applied separately for each individual session, and separately for words and pseudowords, leaving 20,878,641 observations. Next, sessions with negative scores were omitted (as these were likely due to participants pressing the wrong buttons), further reducing the dataset to 20,837,803 data points. Finally, only the data from users who had Italian as their native language (by selecting ‘Italiano’/“Italian” in the profile question about native language and simultaneously selecting ‘Sono madrelingua’/“native speaker” in the question about their level of Italian) were considered. This reduced the dataset to 15,906,229 observations from 189,143 tests and 156,625 user profiles. Of these, 11,177,296 were responses to words, leaving us with an average of 85.65 observations per word.
Participant demographics
Subsequent analyses are based on the user profiles of those who indicated Italian as their first language. The relevant demographic data are presented in Fig. 2.
[See PDF for image]
Fig. 2
Demographic details of the participants indicating Italian as their L1
Among the participants, 5% were monolingual, 40% reported knowing one foreign language, 35% reported knowing two foreign languages, 13% reported knowing three foreign languages, approximately 3% reported knowing four foreign languages, and around 9% reported knowing more than four foreign languages. About 10% did not respond to this question.
When asked which foreign language they knew best, over 70% of participants indicated English, about 7% indicated French, approximately 5% indicated Spanish, and about 1% indicated German.
Finally, 98.66% of users had grown up in Italy, while 0.54% had grown up in other countries and 0.8% did not answer the question. Among the users who indicated Italy as the place where they grew up, 151,866 also indicated a specific region of origin.4 Interestingly, we found that the distribution of users’ origins reflected the population of different Italian regions (based on data retrieved from the ISTAT - Italian National Institute of Statistics - website and updated to January 2024), as shown in Fig. 3 and supported by the high correlation between the number of users from each specific regions and the population of the same region (r = .96).
[See PDF for image]
Fig. 3
Graphical representation of the origin of participants in our study (divided by regions) and the Italian population over regions
Reliability, analyses, and discussion
Full scripts and data for all the analyses are available at: https://osf.io/e4x7w/?view_only=20c718b7a8594fdba02256b475832597.
In the final dataset, 79.3% of the sessions were collected via a touch screen device; in the remaining sessions, answers were given via a keyboard. Since data collected with either device showed a strong correlation (RTs: r = .72; RTszscores: r = .82; accuracy = .92), we merged the data and treated them as a unique dataset.5
We first calculated the split-half reliability of the word data separately for each response variable (accuracy and RTs). Reliability estimates were calculated with custom Python scripts. For 100 iterations, we randomly assigned half of the observations available for each word to one of two groups, and then averaged them. We then calculated the split-half reliability as the Pearson correlation between the response variable in the two groups (r), applying the Spearman–Brown correction (rSB) as is common practice in megastudies involving word recognition times (Aguasvivas et al., 2018; Ferrand et al., 2010; Tse et al., 2017). As a last step, we averaged the scores obtained across all random iterations. Random seeds were set for reproducibility purposes. The average split-half correlation score was high for both the RTs (rSB = 0.8948, SD = 0.0008; 95% CI = 0.8946–0.8950; uncorrected: r = 0.8096, SD = 0.0013; 95% CI = 0.8094–0.8099) and accuracy scores (rSB = 0.9792, SD = 0.0001; 95% CI = 0.9792–0.9793; uncorrected: r = 0.9593, SD = 0.0002; 95% CI = 0.9593–0.9594).
In addition to evaluating the reliability of the full dataset, we explored how the split-half rSB score was affected by the sample size. This analysis can inform future megastudies, indicating the sample size needed to achieve a desired reliability score. Our general procedure was similar to the calculation of the split-half reliability on the full dataset, with the difference that we downsampled the number of observations per word considering six different sample sizes, with N ∈ (5, 10, 20, 40, 60, 80). As in the previous analysis, for every value of N, we randomly assigned half of the observations to one of two groups, averaged them, and calculated the split-half correlation between the responses in the two groups, applying the Spearman–Brown correction. Given the high computational demands of such repeated calculations, we reduced the number of random iterations to 10 (as opposed to 100, used for calculating the reliability on the whole dataset).6 The results of these analyses are reported in Fig. 4; it shows that chronometric data collected online can be highly reliable even with a relatively limited number of observations. With N as small as 20, reaction times can achieve a reliability score of rSB = 0.7887 (SD = 0.0009; 95% CI = 0.7882–0.7893), and accuracy scores have a reliability above 0.9 (rSB = 0.9208, SD = 0.0003; 95% CI = 0.9206–0.9210).
[See PDF for image]
Fig. 4
Split-half reliability for reaction times (left) and accuracy (right) as a function of the considered sample size. The error bars representing the SD are not visible due to the very low variation across random iterations
After assessing the reliability of the data, we obtained accuracy data and averaged response times for the 130,495 words included in the list. As a median, each of these words had been responded to by 80 participants. The distribution of accuracy and response times (RTs) is reported in Fig. 5. RTs were calculated on correct trials only. This resulted in the exclusion of 237 words because no participant recognized them. The overall mean accuracy was .81 (SD = .24; Mdn = 0.93), and the mean RT was 1554 ms (SD = 465.7; Mdn = 1468).
[See PDF for image]
Fig. 5
Distribution of accuracy and RTs to correct trials
As a further test of reliability of the data, RTs and accuracy were analyzed in two separate regression models including orthographic length (in letters) and Zipf-transformed frequency (obtained from SUBTLEX-IT) as predictors. RTs were log-transformed. The model was then refitted by excluding data points with absolute residuals larger than 2.5 SD, to ensure that results were not influenced by a few overly influential outliers (model criticism; see Baayen, 2008). Results of the refitted models are reported in Table 1.
Table 1. Results of the regression models on the reaction times and accuracy on the entire ICP. Predictor estimates are expressed as standardized beta coefficients
Reaction times | Accuracy | |||||
|---|---|---|---|---|---|---|
No. | 130,258 | 130,495 | ||||
Explained variance | 58% | 42% | ||||
Estimate | t | p | Estimate | t | p | |
|---|---|---|---|---|---|---|
Intercept | 7.3432 | 3178.66 | < .0001 | 0.3733 | 171.6 | < .0001 |
Orthographic length | 0.1557 | 81.19 | < .0001 | 0.3605 | 158.6 | < .0001 |
Word frequency | −0.6961 | −363.03 | < .0001 | 0.6712 | 295.4 | < .0001 |
In the analysis of RTs, we observed the expected facilitatory effect of word frequency and the inhibitory effect of orthographic length, with an explained variance of 58%. This is in line with data from the Dutch Crowdsourcing Project (49%; Brysbaert et al., 2019a), the English Crowdsourcing Project (60%; Mandera et al., 2020), and the Spanish Crowdsourcing Project (49%; Aguasvivas et al., 2020). In the accuracy analysis (explained variance 42%) both effects were facilitatory: more accurate responses were observed for more frequent and longer words. While the effect of frequency is coherent with what is usually found in word recognition studies (the more frequent—and hence more familiar—a word is, the higher the probability of knowing it), the effect of length might seem surprising at first glance. In fact, orthographic length usually yields an inhibitory effect, with longer words obtaining lower accuracy. In this case, however, we observed the opposite pattern, and this might be due to the peculiarities of large-scale data, which include a large number of very rare words. Such words might offer more “lexicality cues” when orthographically long, since long words tend to be morphologically complex and hence include some affixes. A participant might be more likely to consider an unfamiliar word as existing when it includes such familiar sublexical elements, leading to a higher recognition rate (and hence accuracy) for longer words vis-à-vis short ones (e.g., Crepaldi et al., 2010; Bonandrini et al., 2023; Amenta et al., in press).
We also assessed the extent to which other lexical and semantic word features can predict RTs and word accuracy. For the Italian language, we do not have extensive corpora as those available for English; however, we were able to compile a smaller dataset of 1026 words with estimates for age of acquisition (AoA; Montefinese et al., 2019), concreteness and imageability (Con, Ima; Montefinese et al., 2014), measures of perceptual strength (maximum perceptual strength [MPS] and Minkowski 3 distance [Mink]; Vergallito et al., 2020), orthographic Levenshtein distance (OLD), phonological Levenshtein distance (PLD), and number of syllables (SumSylls; Goslin et al., 2014), along with word frequency (Crepaldi et al., 2015) and word length (in letters). Table 2 reports the correlation between these variables.
Table 2. Correlations between all predictor variables included in the following regression analysis (N = 1026)
Length | MPS | Mink | Ima | Con | SumSylls | OLD | PLD | AoA | |
|---|---|---|---|---|---|---|---|---|---|
Length | |||||||||
MPS | −.17*** | ||||||||
Mink | −.17*** | .90*** | |||||||
Ima | −.23*** | .70*** | .69*** | ||||||
Con | −.22*** | .71*** | .70*** | .88*** | |||||
SumSylls | .85*** | −.21*** | −.21*** | −.27*** | −.26*** | ||||
OLD | .81*** | −.08** | −.09** | −.15*** | −.11*** | .71*** | |||
PLD | .79*** | −.10** | −.11*** | −.14*** | −.11*** | .71*** | .96*** | ||
AoA | .34*** | −.45*** | −.46*** | −.58*** | −.47*** | .35*** | .33*** | .33*** | |
Frequency | −.41*** | .03 | .08* | .19*** | .10** | −.39*** | −.42*** | −.40*** | −.48*** |
Asterisks indicate levels of statistical significance: * p < 0.05; ** p < 0.01; *** p < 0.001
Given the high correlations among some of the predictors, we relied on the variance inflation factor (VIF) to select which predictors to retain in order to limit collinearity issues. We first ran a regression model on RT data with all the predictors described above. Then, using the car R package (Fox et al., 2012), we computed the VIF for each predictor. Figure 6 shows the distribution of VIF values.
[See PDF for image]
Fig. 6
Variance inflation factor of all predictors considered
As the general rule of thumb, we investigated all predictors with a VIF higher than 4. As expected, given that Italian is an orthographically transparent language, OLD and PLD showed extremely high VIF values and an almost perfect correlation. We decided to drop PLD from the subsequent analysis. For the same reason, we observed a very high correlation between orthographic length and the number of syllables. Since length also showed a high VIF, we decided to drop it and retain the number of syllables for subsequent analysis. Concreteness and imageability also showed high VIF and were also highly correlated, as well as MPS and Minkowski distance. Since previous studies indicated that imageability and Minkowski distance are better predictors of word processing in Italian than concreteness and MPS (Vergallito et al., 2020), we decided to only retain the former predictors in the subsequent analysis.
To assess the impact of the retained predictors, we ran a new regression model on mean RTs on the selected ICP data with mean centered predictors. Reaction times were log-transformed. Word frequency was expressed in Zipf scores. The model was then refitted by excluding data points with absolute residuals larger than 2.5 SD, to exclude the influence of overly influential outliers (Baayen, 2008). VIF values were computed with the same method as above and are reported to show that collinearity issues have been solved. Results are shown in Table 3. Effects are represented in Fig. 7. The RT model accounted for 37.54% of variance.
Table 3. Results of the regression on log RTs (N = 1026). Predictors are centered. Regression weights are expressed as standardized beta coefficients
Estimate | t | p | VIF | |
|---|---|---|---|---|
Frequency | −0.2733 | −8.998 | < .0001 | 1.4816 |
No. syllables | 0.2220 | 5.979 | < .0001 | 2.2147 |
OLD | 0.1677 | 4.534 | < .0001 | 2.1961 |
Age of acquisition | 0.1272 | 3.596 | .0003 | 2.0081 |
Imageability | −0.0543 | −1.434 | .1519 | 2.3021 |
Minkowski | 0.0658 | 1.884 | .0599 | 1.9612 |
[See PDF for image]
Fig. 7
Effect of the predictors on log RTs from the ICP (N = 1026). Effects depicted here refer to non-centered predictors. The first row shows the effects of frequency (Zipf) and number of syllables. The second row shows the effects of the word orthographic similarity to the closest neighbors (OLD) and mean age of acquisition. The fourth row shows the effects of imageability and perceptual strength (Minkowski 3)
The effects in the RT data resembled those found in previous lexical decision experiments in Italian and other languages. The effects of frequency and number of syllables went in the expected direction: shorter (with fewer syllables) and more frequent words were recognized faster than longer and less frequent words (e.g., Brysbaert et al., 2018). Words acquired earlier were also recognized faster than words acquired at a later stage, in line with the existing literature (e.g., Bonin et al., 2001; Brysbaert et al., 2000). The orthographic distance had a significant inhibitory effect on recognition times: words from denser neighborhoods (i.e., words with many similar neighbors) were recognized faster than words from sparser neighborhoods. This is in line with the previous literature (e.g., Pollatsek et al., 1999) and with the English Lexicon Project data (Keuleers et al., 2012), but in contrast to the English Crowdsourcing Project data (see Mandera et al., 2020). In contrast to previous lexical decision studies in Italian (e.g., Vergallito et al., 2020), imageability and perceptual strength, here operationalized as Minkowski 3 distance (Lynott et al., 2020), did not have a significant effect on word recognition latencies. This also constitutes a departure from other crowdsourcing projects (e.g., Mandera et al., 2020), where semantic predictors showed strong and significant effects on the data. As semantic effects tend to be smaller, it is possible that they could not be appreciated given the relatively small item sample in the current analysis.
We repeated the same analysis for averaged accuracy data. In this case, however, the model explained only about 6% of variance. We believe this is because the subset we considered was made of well-known words. In fact, accuracy in this dataset was almost at ceiling (M = .99, SD = .01). For this reason, we do not report and discuss the model in this section (it is however included in the Analyses folder on OSF).
Validation study
As psycholinguistic studies investigating word processing have traditionally been conducted in lab settings, we set out to test the validity of the data collected online by comparing them to lab-collected data. When assessing the reliability of a novel resource, one ideally wants to pit it against the strongest baseline. For this reason, we compared our data to those collected in a controlled lab setting, involving highly educated and young participants (i.e., university students). If we are able, with our data—collected in an arguably less controlled environment, i.e., online data collection, with a less uniform group of participants (e.g., different ages, different education levels, etc.)—to replicate what is found in traditional controlled experiments, this would speak for both the reliability of the data and, incidentally, for the robustness of the reported effects.
Since the Italian language has no lexicon project, we conducted a lab-based lexical decision experiment (arguably the most common task used in word recognition studies), keeping the same list composition of the online study (70% word items and 30% pseudoword items) but adopting the traditional constraints of a lexical decision task in terms of procedure (see Section "Methods").
This experiment offers the further opportunity to evaluate the impact of word prevalence, which thanks to the ICP can be computed for the first time for Italian (see Keuleers et al., 2015; Brysbaert et al., 2019b; Mandera et al., 2020). We compiled a list with the same proportion of words and pseudowords as in the ICP. In addition, we used the same words and pseudowords as in the ICP, randomly extracted to create a manageable list. Finally, we kept the procedure as close as possible to the online procedure, including using the same keys to give an answer. However, consistent with previous literature and existing lexicon projects (e.g., Mandera et al., 2020), we modified the task by using a lexical decision task instead of a vocabulary knowledge task. Details of the procedure are reported below.
Methods
Participants
Data were collected at the University of Milano-Bicocca. The final sample, consisting of 43 participants, was balanced in terms of gender (55.81% women) and uniform in terms of age (M = 21.86, SD = 1.67), and mostly comprised students from the University of Milano-Bicocca (69.09%) who received credits for participating in the experiment. All participants lived in Milan and spoke Italian as their native language. They provided written informed consent, in accordance with the Committee for Minimal-Risk Studies of the Department of Psychology, University of Milano-Bicocca (Approval Protocol Number RM-2018-123).
Materials
Data on lexical decisions were collected on two lists of 500 items each, consisting of 70% words and 30% pseudowords. These stimuli were randomly selected from the list used in the online data collection. Inflected forms sharing the same stem were removed from the word lists to avoid stem repetition. Inappropriate or potentially offensive words were also discarded. The removed items were then replaced with words of the same length and frequency. The resulting words and pseudowords were randomly assigned to the two lists. Lists were counterbalanced to compensate for the inevitable drop in attention in the second phase of the experiment.
Procedure
Participants were invited to turn off their cellphones and wear headphones for the entire duration of the experiment. Additionally, internet connection was disabled in the testing stations. Instructions and experiments were implemented in E-Prime 3.0, and visually mimicked the online experiment as closely as possible. The stimuli were presented in randomized order in white font on a black screen, preceded by a white fixation cross remaining on the screen for 500 ms. Each stimulus remained on the screen until a response was given. An interval of 1000 ms separated the current and the next trial. Participants were instructed to press two different keys (J and F) with the index fingers of each hand to identify the string as a word or pseudoword, respectively. A graphical representation of the correct position of the hands on the keyboard was provided in order to ensure correct performance in the test. The experimental instructions equally emphasized accuracy and speed. A familiarization session composed of 10 items (7 words, 3 pseudowords) followed the presentation of the instructions. Before the beginning of the task, participants were reminded to identify the stimuli as fast as possible; this indication was repeated before the presentation of the second list. Participants were invited to take a break (5–10 minutes) after the presentation of the first list; due to potential effects on reaction times, caffeine consumption was not allowed. When the experiment was concluded, participants were debriefed with respect to the aim of the study.
Results and discussion
Comparison with lab-obtained data
Three items (mitena, a type of glove; psilomelano, a type of mineral; and militassolta, someone who has completed military service) had to be excluded because no participants recognized them as existing Italian words in the lab experiment. Analyses were run on the remaining 697 words. Datapoints were further filtered for accuracy (excluding data associated with wrong responses) and RTs (excluding RTs shorter than 400 ms). By-item average RTs were computed on the resulting dataset.
We observed a high correlation (r = .78; r = .82 for log-transformed reaction times) between average RTs obtained in the online study and average RTs collected in the lab experiment (see Fig. 8), showing substantial consistency between the two samples.
[See PDF for image]
Fig. 8
Correlation between ICP data (x-axis) and lexical decision data (y-axis)
In order to also evaluate to what extent the datasets were comparable in terms of data structure, we further run a parallel regression analysis on log (RT), considering again the length and frequency effects to probe the quality of the data. Results are reported in Table 4.
Table 4. Results of the regression analyses conducted on the ICP data (left) and lab lexical decision data (right). Predictor estimates are expressed as standardized beta coefficients
ICP | Lab lexical decision | |||||
|---|---|---|---|---|---|---|
Estimate | t | p | Estimate | t | p | |
Intercept | 7.2754 | 217.73 | <.0001 | 6.7276 | 214.94 | <.0001 |
Length | 0.1639 | 5.99 | <.0001 | 0.3524 | 12.63 | <.0001 |
Word frequency | −0.6642 | −24.26 | <.0001 | −0.5225 | −18.73 | <.0001 |
In both samples we observe significant effects of length and frequency and similar degrees of explained variance (ICP: 55%; lab lexical decision: 53%). The relative weights of the effects are however slightly different: whereas in the online study RTs are substantially more affected by word frequency, in the lab experiment the impact of length and frequency is more balanced. This might depend on the slightly different instructions provided to participants: in the former they were presented with a vocabulary knowledge test (hence with a strong focus on the lexical properties of the stimulus), in the latter with a more traditional lexical decision task, stressing the importance of both accuracy and speed (hence encouraging strategic behaviors, potentially strengthening the reliance on surface cues such as orthographic length).
Note also that the pattern observed in the subsample of online data considered here (left-hand portion of Table 4) is highly consistent with the results reported for the entire dataset (Table 1). This speaks to the internal consistency of the data collected, on the one hand, and to the representativeness of what was observed in the current validation study for the entire resource on the other.
In conclusion, we tested the validity of the data collected online by comparing them against lab-based lexical decision data. Notwithstanding differences in the task (non-speeded vocabulary knowledge vs. speeded lexical decision), data collection method (online vs. lab-based), and participants involved (a random sample of the population of Italian L1 speakers vs. university students), we demonstrated that word recognition data collected online are consistent with data collected in traditional experiments, as shown by the high correlation between the two sets. Moreover, the convergent patterns of basic effects (word frequency and word length), in terms of both direction and explained variance, speak to the robustness of these effects across tasks and conditions.
The effect of word prevalence
Moving from average accuracy, word prevalence was defined as the boundary of the associated probability assuming a normal curve with mean 0 and standard deviation 1. Note that, following Keuleers et al. (2015), corrected probabilities were obtained as 0.005 + accuracy × 0.99, in order to make differences on the higher end of the range more prominent. Figure 9 represents the distribution of prevalence as compared to the distribution of word frequency for the 697 words considered here. The two variables are moderately correlated (r = .61).
[See PDF for image]
Fig. 9
Distribution of word prevalence and log-transformed word frequency over 697 words included in the validation study
The effect of word prevalence was tested against lab-collected RTs in a regression analysis, also including word frequency and orthographic length as predictors. Results of the analysis are reported in Table 5.
Table 5. Results of the regression model with prevalence. Predictor estimates are expressed as standardized beta coefficients
Estimate | t | p | |
|---|---|---|---|
Intercept | 6.6884 | 242.340 | <.0001 |
Length | 0.4955 | 18.338 | <.0001 |
Frequency | −0.1898 | −6.477 | <.0001 |
Prevalence | −0.4627 | −14.278 | <.0001 |
We observe an effect of word prevalence over and above frequency and length. The explained variance of this model is 65%, indicating that the inclusion of prevalence provides a substantial improvement in terms of model fit (12%, since the model with only frequency and length had an adjusted R-squared of .53). This pattern is in line with the results observed in previous crowdsourcing projects in Dutch (Brysbaert et al., 2016) and English (Brysbaert et al., 2019b).
Case study 1: Inflectional entropy effects for Italian verbs
The effect of inflectional entropy has been studied a number of times in psycholinguistics (e.g., del Prado Martın et al., 2004; Milin et al., 2009a, b). Such a metric reflects the degree of informativeness of an inflectional paradigm based on the frequency distribution of its forms and usually has a facilitatory effect in word recognition: the greater the entropy, the faster the responses (Milin et al., 2009a). Most research in this area has focused on nominal paradigms. These can be quite rich, especially in Slavic languages, which have been extensively researched in this domain (in contrast to Italian, where nouns typically vary in number and hence have only two different forms). However, extending this evidence to verb paradigms would say something about the generalizability of the effect. Romance languages provide an important case study in this regard; in Italian, for example, a given verb paradigm includes more than 100 inflected forms, with limited redundancy in the affixes used. Therefore, we used the present resource to test a verb conjugation effect.
Methods
RTs for 24,357 different verb forms were considered, belonging to 3111 possible paradigms having a lemma frequency higher than 50 (according to SUBTLEX-IT). Such frequency threshold was applied for two reasons: on the one hand, it ensures a certain variability in the usage of different verb forms, allowing for more reliable entropy estimates; on the other hand, it excludes verbs that are very rare, and in turn largely unfamiliar to many speakers, for which an entropy effect might not be observable.
For each paradigm, inflectional entropy was computed by relying on the frequency distributions and the lemmatization data of SUBTLEX-IT. Due to the redundancy of certain inflectional endings, and in line with the standards established by previous studies (Milin et al., 2009a, b), entropy was based on frequency values at the form (rather than the case) level. To account for corpus-unattested forms, each paradigm was artificially populated by zero-frequency inflections needed to obtain a target paradigm size of 176, which was empirically established as the largest paradigm observed in the SUBTLEX-IT corpus (176 forms for the verb fare, “to do”). Estimates were obtained using the function entropy.shrink from the entropy R package (Hausser et al., 2012).
The effect of inflectional entropy was then tested on ICP log-transformed RTs via a mixed-effects regression model. Covariates included log-transformed form frequency, log-transformed lemma frequency, orthographic length, and paradigm size (number of forms actually attested in the corpus for a given verb). The verb lemma was included as a random intercept.
Results and discussion
Results are reported in Table 6. We found a significant facilitatory effect of inflectional entropy for Italian verbs, in line with previous results for nominal paradigms in different languages (e.g., Milin et al., 2009a, b). Furthermore, we observed the expected facilitatory effect of frequency and inhibitory effect of length as well as an unexpected inhibitory effect of paradigm size. This latter is likely due to a suppression effect (Mosteller & Tukey, 1977); in fact, paradigm size is positively correlated with both frequency (r = .39) and entropy (r = .36), possibly leading to an unreliable estimation of its associated parameters. In this context, one must be cautious in interpreting such counterintuitive results. The model had a marginal R-squared value of .48.
Table 6. Inflectional entropy analysis. Results of the mixed-effects regression model on inflected verb ICP data (N = 24,357)
Estimate | t | p | |
|---|---|---|---|
Frequency | −6.466e−02 | −124.64 | <.0001 |
Length | 2.111e−02 | 33.40 | <.0001 |
Paradigm size | 1.244e−03 | 15.49 | <.0001 |
Entropy | −2.235e−02 | −10.67 | <.0001 |
Previous studies of inflectional entropy on nominal paradigms have found a facilitatory effect for entropy in word processing, whereby the more predictable the paradigm, the faster the noun is recognized (del Prado Martın et al., 2004; Milin et al., 2009a, b). With the present analysis we extend these previous results to verb processing, supporting the psycholinguistic validity of this effect and the robustness of the phenomenon across different languages and grammatical classes. The ICP provides scholars interested in paradigmatic effects with an extensive set of behavioral data on which to test their theoretical considerations and model predictions (e.g., Marzi & Pirrelli, 2023).
Case study 2: The “clitic” effect
Clitics (and, more generally, object particles) are morphemic units that attach to a given word, but typically refer to a different element in the sentence or discourse. For example, elements like -ci, -ne, -le, -gli, -lo can be combined with Italian verbs to convey their direct object, indirect object, a reflexive pronoun, etc. Consider parlale, combining parla (s/he speaks) with -le (to her): both a verb and its object are embedded in a single complex word, crucially referring to another element in the sentence (i.e., whatever -le refers to). Clitics are widely studied for their syntactic role in sentence processing (e.g., Arosio et al., 2014; Cresti, 2009; Moscati et al., 2023; Wanner, 1987); however, one may wonder about their impact on isolated word processing: given their crucial bound with other elements in the sentence, one would predict that presenting in isolation a word embedding a clitic will make processing more difficult. We exploited the present resource to test this hypothesis.
Methods
We identified a sample of the proposed resource that allowed us to evaluate the RTs for words with clitics vis-à-vis a set of appropriate control words. This was obtained by filtering the entire databases for verbs (using the part-of-speech annotation of SUBTLEX-IT) and then further selecting the ones (potentially) ending with such elements (-mi, -ci, -ti, -vi, -le, -lo, -la, -li, -gli, -me, -ne, -mi, -si, -se, -te).
The resulting subset of 9740 words was manually annotated to identify words actually embedding a clitic (e.g., spegniti, “turn yourself off”) vis-à-vis words with a clitic-like ending but not actually embedding a clitic (e.g., vorresti, “you would”) and ambiguous cases (e.g., abbinati, “paired” or “pair yourself” depending on which vowel is accented). Once entries from this latter case were excluded, we ended with 9302 words, of which 6240 actually embedded a clitic.
Results and discussion
Average RTs for verbs embedding an actual clitic were longer (1532 ms, SEM = 4.29) than RTs for verbs with clitic-like endings (1492 ms, SEM = 5.78). In order to test this effect, we ran a mixed-effects model, including random intercepts for the verb lemma (since some had many more of its forms included in the list than others). The “clitic” effect was significant (t = 8.78; p = .0001).
As a further control, we included as covariates, in the very same model, the orthographic length and word frequency (obtained from SUBTLEX-IT and log-transformed) of the stimulus. The model had a marginal R-squared of .42. Results are reported in Table 7.
Table 7. Analysis of the “clitic” effect. Results of the mixed-effects regression model on ICP data
Estimate | t | p | |
|---|---|---|---|
Clitic | −2.822e−02 | −7.408 | <.0001 |
Frequency | −5.820e−02 | −56.808 | <.0001 |
Length | 2.616e−02 | 24.848 | <.0001 |
The effect of the presence of a clitic was again found to be significant. However, the associated parameter was estimated to be negative, contrary to the descriptive observations. This is likely due to a suppression effect: the conditions of interest are, in fact, not matched for frequency (2.45 vs. 2.26, p = 0.0001) or length (9.12 vs. 9.97; p = .0001). This pattern makes us cautious in interpreting the effect as genuinely dependent on the presence of a clitic. Since verbs embedding a clitic tend to be longer and to have a lower frequency, their increased difficulty might be simply due to these general-level properties.
General discussion
This work was aimed at creating the first large collection of accuracy and response latency data for word recognition in Italian. This is the largest word recognition experiment conducted for the Italian language, in terms of both items (over 130,000) and participants from all over Italy (over 150,000). Of particular interest is that the stimulus set includes many inflected forms, in contrast to other crowdsourcing projects, which only presented lemmas (uninflected word forms).
Although participants were not instructed to respond as fast as possible (as in traditional lexical decision experiments), our findings replicated previous observations that the time needed to respond gives useful information, when the first nine trials and extreme outliers are excluded. The most likely reason for this is that each word was responded to by some 100 participants. This left room for data cleaning (we still had 80 responses per word after the cleaning).
Our RT data correlated highly (r = .78) with data obtained in a separate lab-based study involving a representative subsample of 697 words). Furthermore, we were able to replicate the most important effects found in previous megastudies in other languages and small-scale studies in Italian.
This work describes a valuable resource for language researchers interested in peculiarities of the Italian language and for colleagues interested in cross-linguistic research.
Italian is interesting because it presents a variety of inflectional forms and morphological variations (especially with respect to verbs) in a transparent orthography. The present work is particularly interesting because, for the first time, the stimulus set includes a large number of inflected forms, which may allow us to calculate processing times for such forms (which are ubiquitous in Italian sentences and texts) and relate them, for example, to eye movement data (Siegelman et al., 2022). We have included two case studies as examples of the possibilities offered by the present resource. We have shown that the RT database can be used both to extend previous evidence to new languages and domains (inflectional entropy effects in Italian verbs), and to explore new empirical phenomena to be further tested in future experiments (a possible “clitic effect” in the processing of isolated words).
Of course, the two case studies included in the present study are, in fact, just two examples of the potential applications of this resource. Besides existing morphologically complex forms, we also included a selection of novel words and colloquialisms rarely found in written format. We believe that these items could be useful to those interested in the study of slang or neologisms.
Moreover, the ICP includes, for the first time for Italian, a measure of word prevalence (see Keuleers et al., 2015; Brysbaert et al., 2019b). Word prevalence is not only extremely useful when studying speakers’ knowledge of low-frequency words, but it is indeed invaluable information for the creation of items set for experimental studies. It can hence serve as both a dependent variable, to investigate which properties make a word more likely to be known by the speakers of a language, and an independent variable, to test how this aspect can affect word processing. Furthermore, paired with the additional information we collected and made available on OSF, prevalence allows researchers to investigate the geographical distribution of vocabulary knowledge, potentially contributing to dialectology and the study of regionalisms.
Last, but not less important, we believe that the ICP will also contribute to the effort of a cross-linguistic and cross-task validation of word processing effects. The growing presence of large datasets based on the same methodology but encompassing different languages represents a huge opportunity for researchers in language processing to move away from the still pervasive Anglocentrism of the field (Levisen, 2019). This direction has been long advocated, but only recently has a concrete effort been made to create linguistic resources that allow for a systematic cross-linguistic comparison on a large number of topics (e.g., Siegelman et al., 2022, 2024; Kuperman, 2022; Kuperman et al., 2023; Sulpizio et al., 2024). Regarding cross-task validation, large language resources like the ICP and other datasets from the same family lend themselves to comparison with other language processing resources employing different tasks (e.g., lexical decision, priming), whereby it would be possible to systematically study the robustness of psycholinguistic effects across tasks and across different sample sizes (for some attempts in this direction see Mandera et al., 2020).
Availability
The raw data along with the Italian Crowdsourcing Project data are available at https://osf.io/e4x7w/?view_only=20c718b7a8594fdba02256b475832597.
The raw data include all unprocessed responses to all words and pseudowords. The ICP data are instead presented in a more manageable format in a dedicated .csv file created following the criteria described in the present paper. This file is organized as follows: the first column (spelling) contains the word; the second column (nobs) contains the number of observations on which RTs are based; the third column (accuracy) contains the average accuracy rate; the fourth and fifth columns contain the average response time (rt_correct_mean) and relative standard deviation (rt_correct_sd), respectively; and the sixth and seventh columns contain the standardized averaged response times (rt_zscore_correct_mean) and relative standard deviation (rt_zscore_correct_sd). Finally, we release an ICP prevalence dataset including the word (spelling), the number of observation (nobs), the average accuracy (accuracy) and prevalence, computed following Keuleers et al., (2015; we refer to this paper for more details about prevalence and how to use this variable in psycholinguistic studies).
Authors’ contributions
Ideation: SA, PM, EK, MB, MM. Online data collection: SA, PM, MM. Lab data collection: AGDV. Online data preprocessing: PM. Data analysis: SA, AGDV, MM. First draft: SA, AGDV, MM, PM, EK, MB; First revision—analyses: SA, AGDV, PM. First revision—draft: SA, AGDV, MM, PM, EK, MB.
Funding
Simona Amenta, Marc Brysbaert, and Marco Marelli gratefully acknowledge the financial support of the Research Foundation – Flanders (FWO) through grant 3G011617W.
Data availability
All data, analysis code, and materials are available at: https://osf.io/e4x7w/?view_only=20c718b7a8594fdba02256b475832597
Declarations
Ethics approval
Approval was obtained from the ethics committee of the University of Milano-Bicocca (see corresponding sections in the manuscript). The procedures used in this study adhere to the tenets of the Declaration of Helsinki.
Consent to participate
Informed consent was obtained from all individual participants included in the study.
Consent for publication
No data allowing the identification of participants were collected, included in this paper or in any of the available materials. Participants agreed to the publication of their behavioral data in anonymous and/or aggregated form.
Competing interests
Marco Marelli and Simona Amenta serve as Associate Editor and Consulting Editor, respectively, on the Editorial Board of this Journal. The authors have no competing interests to declare that are relevant to the content of this article.
Participants could do the test multiple times, because they received different words each time. To limit the contribution of each participant, the authors decided to consider only the first three sessions per participant in their analysis (see also below).
2Clitics are morphemic units attached to a word but referring to another element in the discourse (e.g., “andarci,” go there; “credimi,” believe me).
3This proportion of word and pseudoword items is standard in crowdsourcing studies, and it is motivated by the necessity of keeping the influence of low-frequency items under control. When sampling from the whole lexicon of a language, most word items will have a very to extremely low frequency (see also Fig. 1). This means that rare word items have a high probability of ending up in lists. These rare words will be unknown by most participants and thus will look like pseudowords to them. Including a smaller number of pseudowords makes it possible to compensate for the feeling of having too many unfamiliar items in the list.
4Note that while this information was not used in the present analyses, it is made available in the raw dataset.
5Information about the device used to produce a response is listed in the raw data dataset.
6Note that the reduced variation across iterations indicates that a large number of random initializations is not needed to obtain reliable estimates.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
Aguasvivas, JA; Carreiras, M; Brysbaert, M; Mandera, P; Keuleers, E; Duñabeitia, JA. SPALEX: A Spanish lexical decision database from a massive online data collection. Frontiers in Psychology; 2018; 9, 2156. [DOI: https://dx.doi.org/10.3389/fpsyg.2018.02156] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30483181][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6240651]
Aguasvivas, J; Carreiras, M; Brysbaert, M; Mandera, P; Keuleers, E; Duñabeitia, JA. How do Spanish speakers read words? Insights from a crowdsourced lexical decision megastudy. Behavior Research Methods; 2020; 52, pp. 1867-1882. [DOI: https://dx.doi.org/10.3758/s13428-020-01357-9] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32072567]
Amenta, S., Foppolo, F., & Badan, L. (in press). The role of morphological information in processing pseudo-words in Italian L2 learners: It’s a matter of experience. Journal of Cognition.
Amenta, S; Marelli, M; Sulpizio, S. From sound to meaning: Phonology-to-semantics mapping in visual word recognition. Psychonomic Bulletin & Review; 2017; 24, pp. 887-893. [DOI: https://dx.doi.org/10.3758/s13423-016-1152-0]
Arosio, F; Branchini, C; Barbieri, L; Guasti, MT. Failure to produce direct object clitic pronouns as a clinical marker of SLI in school-aged Italian speaking children. Clinical Linguistics & Phonetics; 2014; 28,
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge University Press.
Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., …, Treiman, R. (2007). The English lexicon project. Behavior Research Methods, 39, 445–459.
Balota, D. A., Yap, M. J., Hutchison, K. A., & Cortese, M. J. (2012). Megastudies: What do millions (or so) of trials tell us about lexical processing?. In Visual word recognitionvolume 1 (pp. 90–115). Psychology Press.
Bonandrini, R; Amenta, S; Sulpizio, S; Tettamanti, M; Mazzucchelli, A; Marelli, M. Form to meaning mapping and the impact of explicit morpheme combination in novel word processing. Cognitive Psychology; 2023; 145, [DOI: https://dx.doi.org/10.1016/j.cogpsych.2023.101594] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37598658]101594.
Bonin, P; Chalard, M; Méot, A; Fayol, M. Age-of acquisition and word frequency in the lexical decision task: Further evidence from the French language. Current Psychology of Cognition; 2001; 20,
Brysbaert, M; Stevens, M; Mandera, P; Keuleers, E. The impact of word prevalence on lexical decision times: Evidence from the Dutch Lexicon Project 2. Journal of Experimental Psychology: Human Perception and Performance; 2016; 42,
Brysbaert, M; Mandera, P; Keuleers, E. The word frequency effect in word processing: An updated review. Current Directions in Psychological Science; 2018; 27,
Brysbaert, M; Keuleers, E; Mandera, P. Recognition times for 54 thousand Dutch words: Data from the Dutch Crowdsourcing Project. Psychologica Belgica; 2019; 59,
Brysbaert, M; Mandera, P; McCormick, SF; Keuleers, E. Word prevalence norms for 62,000 English lemmas. Behavior Research Methods; 2019; 51, pp. 467-479. [DOI: https://dx.doi.org/10.3758/s13428-018-1077-9] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29967979]
Brysbaert, M; Lange, M; Wijnendaele, IV. The effects of age-of-acquisition and frequency-of-occurrence in visual word recognition: Further evidence from the Dutch language. European Journal of Cognitive Psychology; 2000; 12,
Crepaldi, D; Rastle, K; Davis, CJ. Morphemes in their place: Evidence for position-specific identification of suffixes. Memory & Cognition; 2010; 38, pp. 312-321. [DOI: https://dx.doi.org/10.3758/MC.38.3.312]
Crepaldi, D., Amenta, S., Pawel, M., Keuleers, E., & Brysbaert, M. (2015). SUBTLEX-IT. Subtitle-based word frequency estimates for Italian. In Proceedings of the Annual Meeting of the Italian Association for Experimental Psychology (pp. 10–12).
Cresti, E. Clitics and anaphoric relations in informational patterning: A corpus-driven research in spontaneous spoken Italian (C-ORAL-ROM). Information Structure and Its Interfaces; 2009; 19, 169. [DOI: https://dx.doi.org/10.1515/9783110213973.2.169]
del Prado Martın, FM; Kostić, A; Baayen, RH. Putting the bits together: An information theoretical perspective on morphological processing. Cognition; 2004; 94,
Ethnologue. (2023). “What are the top 200 most spoken languages?” Retrieved 2024-01-24.
Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Méot, A., … Pallier, C. (2010). The French Lexicon Project: Lexical decision data for 38,840 French words and 38,840 pseudowords. Behavior Research Methods, 42, 488–496.
Fox, J., Weisberg, S., Adler, D., Bates, D., Baud-Bovy, G., Ellison, S., … Heiberger, R. (2012). Package ‘car’. R Foundation for Statistical Computing, 16(332), 333.
Goslin, J; Galluzzi, C; Romani, C. PhonItalia: A phonological lexicon for Italian. Behavior Research Methods; 2014; 46, pp. 872-886. [DOI: https://dx.doi.org/10.3758/s13428-013-0400-8] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24092524]
Hartshorne, JK; de Leeuw, JR; Goodman, ND; Jennings, M; O’Donnell, TJ. A thousand studies for the price of one: Accelerating psychological science with Pushkin. Behavior Research Methods; 2019; 51, pp. 1782-1803. [DOI: https://dx.doi.org/10.3758/s13428-018-1155-z] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30746644]
Hausser, J., Strimmer, K., & Strimmer, M. K. (2012). Package ‘entropy’. R Foundation for Statistical Computing.
Henrich, J; Heine, SJ; Norenzayan, A. The weirdest people in the world?. Behavioral and Brain Sciences; 2010; 33,
Hubert, M., & Vandervieren, E. (2008). An adjusted boxplot for skewed distributions. Computational Statistics & Data Analysis, 52, 5186–5201.
Keuleers, E; Balota, DA. Megastudies, crowdsourcing, and large datasets in psycholinguistics: An overview of recent developments. The Quarterly Journal of Experimental Psychology; 2015; 68,
Keuleers, E; Brysbaert, M. Wuggy: A multilingual pseudoword generator. Behavior Research Methods; 2010; 42, pp. 627-633. [DOI: https://dx.doi.org/10.3758/BRM.42.3.627] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/20805584]
Keuleers, E; Diependaele, K; Brysbaert, M. Practice effects in large-scale visual word recognition studies: A lexical decision study on 14,000 Dutch mono-and disyllabic words and nonwords. Frontiers in Psychology; 2010; 1, 174. [DOI: https://dx.doi.org/10.3389/fpsyg.2010.00174] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/21833236][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3153785]
Keuleers, E; Lacey, P; Rastle, K; Brysbaert, M. The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods; 2012; 44, pp. 287-304. [DOI: https://dx.doi.org/10.3758/s13428-011-0118-4] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/21720920]
Keuleers, E; Stevens, M; Mandera, P; Brysbaert, M. Word knowledge in the crowd: Measuring vocabulary size and word prevalence in a massive online experiment. Quarterly Journal of Experimental Psychology; 2015; 68,
Kuperman, V. A cross-linguistic study of spatial parameters of eye-movement control during reading. Journal of Experimental Psychology: Human Perception and Performance; 2022; 48,
Kuperman, V., Siegelman, N., Schroeder, S., Acartürk, C., Alexeeva, S., Amenta, S., … Usal, K. A. (2023). Text reading in English as a second language: Evidence from the Multilingual Eye-Movements Corpus. Studies in Second Language Acquisition, 45(1), 3–37.
Kyröläinen, A-J; Keuleers, E; Mandera, P; Brysbaert, M; Kuperman, V. Affect across adulthood: Evidence from English, Dutch, and Spanish. Journal of Experimental Psychology: General; 2021; 150,
Levisen, C. Biases we live by: Anglocentrism in linguistics and cognitive sciences. Language Sciences; 2019; 76, [DOI: https://dx.doi.org/10.1016/j.langsci.2018.05.010] 101173.
Lynott, D; Connell, L; Brysbaert, M; Brand, J; Carney, J. The Lancaster Sensorimotor Norms: Multidimensional measures of perceptual and action strength for 40,000 English words. Behavior Research Methods; 2020; 52, pp. 1271-1291. [DOI: https://dx.doi.org/10.3758/s13428-019-01316-z] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31832879]
Mandera, P; Keuleers, E; Brysbaert, M. Recognition times for 62 thousand English words: Data from the English Crowdsourcing Project. Behavior Research Methods; 2020; 52, pp. 741-760. [DOI: https://dx.doi.org/10.3758/s13428-019-01272-8] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31368025]
Marzi, C; Pirrelli, V. A discriminative information-theoretical analysis of the regularity gradient in inflectional morphology. Morphology; 2023; 33,
Maziyah Mohamed, M; Yap, MJ; Chee, QW; Jared, D. Malay Lexicon Project 2: Morphology in Malay word recognition. Memory & Cognition; 2023; 51,
Milin, P; Đurđević, DF; del Prado Martín, FM. The simultaneous effects of inflectional paradigms and classes on lexical recognition: Evidence from Serbian. Journal of Memory and Language; 2009; 60,
Milin, P; Kuperman, V; Kostic, A; Baayen, RH. Paradigms bit by bit: An information theoretic approach to the processing of paradigmatic structure in inflection and derivation. Analogy in Grammar: Form and Acquisition; 2009; 381, pp. 214-252. [DOI: https://dx.doi.org/10.1093/acprof:oso/9780199547548.003.0010]
Montefinese, M; Ambrosini, E; Fairfield, B; Mammarella, N. The adaptation of the affective norms for English words (ANEW) for Italian. Behavior Research Methods; 2014; 46, pp. 887-903. [DOI: https://dx.doi.org/10.3758/s13428-013-0405-3] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24150921]
Montefinese, M; Vinson, D; Vigliocco, G; Ambrosini, E. Italian age of acquisition norms for a large set of words (ItAoA). Frontiers in Psychology; 2019; 10, 278. [DOI: https://dx.doi.org/10.3389/fpsyg.2019.00278] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30814969][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6381031]
Moscati, V; Marini, A; Biondo, N. What a thousand children tell us about grammatical complexity and working memory: A cross-sectional analysis on the comprehension of clitics and passives in Italian. Applied Psycholinguistics; 2023; 44,
Mosteller, F., & Tukey, J. W. (1977). Woes of regression coefficients. Data analysis and regression: A second course in statistics. Pearson.
Pollatsek, A; Perea, M; Binder, KS. The effects of “neighborhood size” in reading and lexical decision. Journal of Experimental Psychology: Human Perception and Performance; 1999; 25,
Siegelman, N., Schroeder, S., Acartürk, C., Ahn, H. D., Alexeeva, S., Amenta, S., … Kuperman, V. (2022). Expanding horizons of cross-linguistic research on reading: The Multilingual Eye-movement Corpus (MECO). Behavior Research Methods, 54(6), 2843–2863.
Siegelman, N., Elgort, I., Brysbaert, M., Agrawal, N., Amenta, S., Arsenijević Mijalković, J., .. & Kuperman, V. (2024). Rethinking first language–second language similarities and differences in English proficiency: Insights from the ENglish Reading Online (ENRO) project. Language Learning, 74(1), 249–294. https://doi.org/10.1111/lang.12586
Stubbe, R. Do pseudoword false alarm rates and overestimation rates in Yes/No vocabulary tests change with Japanese university students’ English ability levels?. Language Testing; 2012; 29,
Sulpizio, S., Günther, F., Badan, L., Basclain, B., Brysbaert, M., Chan, Y. L., … Marelli, M. (2024). Taboo language across the globe: A multi-lab study. Behavior Research Methods, 1–20.
Sze, WP; Rickard Liow, SJ; Yap, MJ. The Chinese Lexicon Project: A repository of lexical decision behavioral responses for 2,500 Chinese characters. Behavior Research Methods; 2014; 46, pp. 263-273. [DOI: https://dx.doi.org/10.3758/s13428-013-0355-9] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/23813237]
Tse, CS; Yap, MJ; Chan, YL; Sze, WP; Shaoul, C; Lin, D. The Chinese Lexicon Project: A megastudy of lexical decision performance for 25,000+ traditional Chinese two-character compound words. Behavior Research Methods; 2017; 49, pp. 1503-1519. [DOI: https://dx.doi.org/10.3758/s13428-016-0810-5] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27734329]
Tse, C. S., Chan, Y. L., Yap, M. J., & Tsang, H. C. (2022). The Chinese Lexicon Project II: A megastudy of speeded naming performance for 25,000+ traditional Chinese two-character words. Behavior Research Methods, 1–21.
Vergallito, A; Petilli, MA; Marelli, M. Perceptual modality norms for 1,121 Italian words: A comparison with concreteness and imageability scores and an analysis of their impact in word processing tasks. Behavior Research Methods; 2020; 52, pp. 1599-1616. [DOI: https://dx.doi.org/10.3758/s13428-019-01337-8] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31950360]
Wanner, D. Clitic pronouns in Italian: A linguistic guide. Italica; 1987; 64,
Yap, MJ; Liow, SJR; Jalil, SB; Faizal, SSB. The Malay Lexicon Project: A database of lexical statistics for 9,592 words. Behavior Research Methods; 2010; 42, pp. 992-1003. [DOI: https://dx.doi.org/10.3758/BRM.42.4.992] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/21139166]
Yap, MJ; Balota, DA; Sibley, DE; Ratcliff, R. Individual differences in visual word recognition: Insights from the English lexicon project. Journal of Experimental Psychology: Human Perception and Performance; 2012; 38,
© The Psychonomic Society, Inc. 2024.