Content area
Imageability, an important word characteristic in the psycholinguistic literature, is typically assessed by asking participants to estimate the ease with which a word can evoke a mental image. Our aim was to explore inter-rater disagreement in normative imageability ratings. We examined the predictors of variability around average imageability ratings for young, middle-aged and older adults (Study 1) and assessed its impact on visual word recognition performance in young adults (Study 2). Analyses of French age-related imageability ratings (Ballot et al., Behavior Research Methods, 54, 196–215, 2022) revealed that inter-rater disagreement around the average imageability value was critically high for most words within the imageability norms, thus questioning the construct validity of the average rating for the most variable items. Variability in ratings changed between age groups (18-25, 26-40, 41-59, and over 60 years) and was associated with words that are longer, less frequent, learnt later in life and less emotional (Study 1). To examine the consequences of elevated standard deviations around the average imageability rating on visual word recognition, we entered this factor in a hierarchical regression alongside classic lexico-semantic predictors. The effect of word-imageability on young adults’ lexical decision times (Ferrand et al., Behavior Research Methods, 50, 1285–1307, 2018) remained significant after accounting for inter-rater disagreement in imageability ratings, even when considering the least consensual words (Study 2). We conclude that imageability ratings reliably predict visual word recognition performance in young adults for large datasets, but might require caution for smaller ones. Given imageability rating differences across adulthood, further research investigating age-related differences in language processing is necessary.
Reading a word usually involves accessing relevant linguistic information about it, including its meaning(s). Some words may trigger a mental image of what they stand for (e.g., strawberry). The ease with which a word gives rise to a sensory mental representation in the absence of an external percept is referred to as ‘imageability’ (e.g., Paivio et al., 1968; Rofes et al., 2018). Individuals are typically faster and more accurate when responding to highly imageable words (e.g., strawberry) than to less imageable ones (e.g., entity) in visual word recognition tasks such as lexical decision (e.g., Ballot et al., 2022; Ferrand et al., 2018; Khanna & Cortese, 2021), naming (Ferrand et al., 2011; Yap, Pexman et al., 2012b), and progressive demasking (Ferrand et al., 2011; Ploetz & Yates, 2016; Yap, Pexman, et al., 2012b). The usefulness of imageability ratings in predicting how verbal material is processed extends to other cognitive domains, such as semantic classification (Yap, Pexman, et al., 2012b) and memory (e.g., Ballot et al., 2021; Khanna & Cortese, 2021; Lau et al., 2018). Furthermore, imageability typically accounts for word processing performance more effectively than other lexico-semantic variables, such as concreteness or age of acquisition (e.g., Khanna & Cortese, 2021; Su et al., 2022) across several tasks and languages (e.g., English, Balota et al., 2004; Yap, Balota, et al., 2012a; French, Ballot et al., 2022; Ferrand et al., 2018; Italian, Vergallito et al., 2020; Chinese, Su et al., 2022). Imageability ratings are thus believed to reflect a prominent dimension of word meaning and correlate strongly between languages (Rofes et al., 2018). As such, word imageability is considered crucial to the specification of theories about lexico-semantic representations and processing. Lexico-semantic processing refers to the cognitive processes through which readers represent, access and retrieve words and their meaning (Meteyard & Vigliocco, 2018), including sensorimotor information, from the mental lexicon. By highlighting the role of sensory experiences in language, imageability could provide a deeper understanding of how words are represented and processed in the mind, and also allow to test hypotheses born from theories such as embodiment (e.g., Barsalou, 2020).
Despite evidence of the role of imageability in visual word recognition, a stark contrast has been observed between the effect sizes reported in experimental studies (.24 to .44, e.g., Evans et al., 2012; Ploetz & Yates, 2016; Rojas et al., 2022) and in megastudies based on large-scale datasets, as the latter systematically report much smaller effect sizes (.002 to .03, e.g., Ballot et al., 2022; Dymarska et al., 2023; Khanna & Cortese, 2021). Moreover, several studies found that imageability ratings produced inconsistent effects across studies as they sometimes failed to predict accuracy and/or reaction times in lexical decision tasks (LDT) or naming tasks (for review, see Dymarska et al., 2023). These results, along with the small effect sizes reported in megastudies, led Dymarska et al. (2023) to conclude that the typical imageability effect may be considerably weaker than previously believed. These authors further suggested that other candidate variables should be prioritized over imageability when investigating lexico-semantic processing. Moreover, they proposed that imageability effects could be either magnified or suppressed depending on the inclusion of control variables, as imageability covaries with several possible confounds.
Among the various psycholinguistic variables correlated with imageability (e.g., Bonin et al., 2018; Dymarska et al., 2023), Age of Acquisition (AoA), which refers to the age at which participants believe to first have learned a specific word, displays a moderate-to-strong correlation with imageability. Indeed, highly imageable words also tend to be learned earlier in life (e.g., Bird et al., 2001; Kuperman et al., 2012), with both variables being considered important predictors of word processing (e.g., Cortese & Schock, 2013). Furthermore, emotional words have been found to be more imageable than less emotional ones (Ballot et al., 2022; Westbury et al., 2013), and performance facilitation by imageability seems particularly strong in low-frequency words (Connell & Lynott, 2018). The relationship between imageability and subjective frequency ratings ranges from weak (r= .12, Westbury, 2014) to moderate (r=.33 in Ballot et al., 2022; r=.26 in Desrochers & Thompson, 2009), as words rated as less frequent are also rated as harder to imagine. However, imageability effects cannot be interpreted as an artefact of word frequency (Westbury, 2014), as they typically persist after both objective and subjective word frequencies have been taken into account (e.g., Ballot et al., 2022; Cortese & Schock, 2013). Regarding the relationship between imageability and AoA, the pattern of results is more contrasted. Imageability has been shown to explain additional variance beyond AoA (e.g., Cortese & Schock, 2013), suggesting that these two predictors cannot be conflated. However, controlling for AoA can give rise to reduced effect sizes for imageability (Dymarska et al., 2023), which in some cases even fails to reach significance, depending on tasks, wordsets and covariables (e.g., Brysbaert et al., 2000; Ploetz & Yates, 2016).
Nevertheless, after controlling for a comprehensive panel of lexical and sensorimotor variables, results from a recent investigation of several imageability norms led authors to conclude that differences in the predictive power of imageability ratings cannot be fully accounted for by differences in lexico-semantic characteristics of the words themselves (Dymarska et al., 2023). These authors further argue that the variance in predictive power between imageability norms did not result from differences in procedure and instructions, which were identical across norming studies. Therefore, fluctuations in imageability effects may have other explanations, which could intrinsically relate to the measure of word imageability itself, and more specifically to the subjective nature of these ratings.
Inter-individual variability and imageability ratings
Word meaning is notoriously difficult to measure objectively, as it “ultimately resides within the language user’s head” (Winter, 2022, p. 492). To approximate a measure of subjective word meaning, lexico-semantic variables such as word imageability are typically assessed by norming studies in which participants are requested to rate a number of words, which are then combined into a larger dataset. Imageability ratings are usually collected using five- or seven-point scales (e.g., Ballot et al., 2022; Bird et al., 2001; Desrochers & Thompson, 2009; Paivio et al., 1968). For each word in a given dataset, individual ratings are then averaged. The resulting normative rating for a word usually comprises the mean rating and the standard deviation of participant ratings around this average value. Psycholinguistic norms such as imageability ratings help researchers to study subjective linguistic phenomena at the population level. They provide normed stimuli which can be reused between studies and are thought to minimize the impact of inter-individual subjectivity by averaging several individual responses (Winter, 2022). Imageability norms are expected to provide reliable estimations of the underlying semantic property of words, considering the high correlations between databases from the same language (e.g., Ballot et al., 2022; Schock et al., 2012), across languages (e.g., Rofes et al., 2018), and with theoretically related constructs such as concreteness (e.g., Bonin et al., 2018).
The averaging approach, which is widely used in cognitive psychology, presupposes a measure of homogeneity in cognitive processes across participants (see Andrews, 2015). It has been increasingly challenged in the recent psycholinguistic literature, as differences in language abilities may be pervasive between individuals throughout the lifespan (Kidd et al., 2018). This statement holds for language proficiency (Andrews, 2015; Dujardin et al., 2022; Kidd et al., 2018) and might extend beyond it, as lexico-semantic processing is believed to undergo significant changes between early and late adulthood, at both the behavioural (e.g., Krethlow et al., 2020) and neural levels (e.g., Hoffman, 2018). According to Ballot et al. (2022), the distribution of imageability ratings is not immutable across the adult lifespan. As people get older, this distribution tends towards higher imageability estimations, the difference between distributions being the most drastic when comparing adults aged 18 to 25 with adults aged 60 and over. In Ballot et al.’s (2022) database, these age-related differences in imageability ratings concerned more than one-third of the words. Performance in the LDT of the Megalex megastudy (Ferrand et al., 2018) was predicted better by imageability ratings from young adults than by ratings from older adults (Ballot et al., 2022), thus matching the population from which behavioural data had been collected. Moreover, the relationship between imageability and other major psycholinguistic variables also shows age-related modifications, as the correlation between imageability ratings and subjective frequency ratings decreased from moderate to weak as participants grew older (Ballot et al., 2022; Simonsen et al., 2013). Altogether, these results provide evidence of age-related changes in lexico-semantic representations (Ballot et al., 2022; Kidd et al., 2018; Rofes et al., 2018; Simonsen et al., 2013) and underline both how crucial age-specific semantic indicators are, and the importance of studying lexico-semantic behaviours across the lifespan (Grandy et al., 2020; Krethlow et al., 2020). However, ratings by age group or for other groups than the 18 to 25-year-olds are scarce. The lack of age-related imageability ratings matching the population from which large behavioural datasets were collected might produce noisy analyses in the study of imageability effects. However, the small effect sizes reported in megastudies, even when using age-adequate ratings (Ballot et al., 2022), suggest the existence of other sources of noise.
Measuring subjectivity: a recurring issue in semantic norms
Given the importance of semantic norms in the investigation and theory-building surrounding cognitive processes, and the sometimes-conflicting results found in the literature on memory, Pollock (2018) performed an analysis of concreteness norms in several English databases and highlighted important methodological and statistical issues typical of how semantic variables are usually measured. Focusing on concreteness ratings from Brysbaert et al.’s (2014) database, Pollock (2018) found that for most words with intermediate mean concreteness values (around three on a five-point scale), the standard deviation was well above 1 and even as far as 4.5 SDs. Intermediate average values with such elevated standard deviations indicate that half of the participants rated the same word as abstract, while the other half rated it as concrete (Pollock, 2018; see also Winter, 2022). This startling observation raises the question of construct validity for words on which inter-rater disagreement is high when performing semantic ratings. Moreover, when graphically plotting the distribution of cumulated average ratings from two English imageability databases (Cortese & Fugett, 2004; Schock et al., 2012) as a function of the standard deviation of these ratings, Pollock (2018) found that the latter followed a similar curvilinear distribution as concreteness, with middle-range ratings being the less consensual between participants (see also Paisios et al., 2023 for similar results on Body Object Interaction -BOI- ratings, which refer to the ease or difficulty with which a human body can physically interact with a given object). These findings have important conceptual implications regarding the interpretability of average ratings: if, for the same word, participants can give conceptually opposed answers (e.g., abstract/concrete, positive/negative, imageable/non-imageable), then the average value cannot be used as reflecting a theoretically relevant dimension of the word, but rather reflects inter-rater disagreement. The possible causes of such disagreement, whether related to the words themselves, to participant characteristics or to an interaction of both, are yet to be investigated to understand how these average ratings should be interpreted. High levels of inter-rater disagreement might result from participants using various criteria at the time of judgement, or from between-participant differences in their familiarity with the words under consideration (Pollock, 2018). However, inter-rater disagreement is yet to be investigated with regard to predicting variability in imageability ratings.
The present research
The purpose of the present research was to examine whether word imageability ratings vary within and across age groups, and the possible implications regarding visual word recognition. The aim of Study 1 was to explore the distribution of word imageability ratings and their variability across adulthood, in order to determine lexico-semantic predictors of inter-participant disagreement by taking the putative age-related changes in language representations into account. We used Ballot et al.’s (2022) database, which contains age-stratified imageability and subjective frequency ratings for the French adult population (ages 18 to over 60). The aim of Study 2 was to determine to what extent the inter-rater disagreement in imageability ratings identified in Study 1 influences the imageability effect in visual word recognition in young adults, by using LDT data from the Megalex database (Ferrand et al., 2018).
Study 1
Imageability has been shown to facilitate performance in several visual recognition tasks. It may be crucial in our understanding of lexico-semantic processing, and its importance in studies using verbal material has led to sustained efforts to produce imageability norms across languages (e.g., Acar et al., 2016; Desrochers & Thompson, 2009; Su et al., 2022) and populations (e.g., Ballot et al., 2022; Grandy et al., 2020; Simonsen et al., 2013). Such ratings provide normed stimuli which are thought to minimize the impact of inter-individual subjectivity by averaging a large number of individual responses (Winter, 2022). However, although several studies suggest that lexico-semantic processing changes over the adult lifespan (e.g., Ballot et al., 2022; Wulff et al., 2022), lexical databases rarely provide age-inclusive or age-stratified data, thus hindering the in-depth investigation of semantic effects. Moreover, middle-scale ratings for words have been shown to emerge from high inter-rater disagreement across several semantic norms (Paisios et al., 2023; Pollock, 2018). Major disagreement between raters compromises the interpretation of averaged ratings for a large set of words provided in databases (e.g., Pollock, 2018). As it pertains to imageability, a similar pattern of rating could be reflected by participant performance in visual word recognition and provide new insight regarding the interpretation of the literature on imageability effects. However, the properties of words which could induce such disagreement are currently unknown, and previously published work did not provide any direct investigation as to the sources underlying these rating behaviours.
This study focused on: a) exploring the distribution of imageability ratings and their inter-rater variability using a descriptive approach on a French lexical database; and b) identifying predictors of variability within and across adult age groups. We investigated word-level predictors of variability in imageability ratings, as indexed by standard deviations around the average distribution of imageability ratings, using an age-related database (Ballot et al., 2022). Based on previous research reporting that imageability ratings could be predicted from other word characteristics using linear regressions (Reilly & Kean, 2007; Westbury, 2014), we aimed to determine if the distribution of inter-rater disagreement in imageability ratings was biased towards some types of words. Considering the overall age-related changes in semantic organization (e.g., Wulff et al., 2022), the shift in imageability ratings towards the higher end of the scale in older adults and the age-related modification of correlations between imageability and subjective frequency (e.g., Ballot et al., 2022) reported in the literature, we aimed to determine word-level predictors of standard deviation in imageability ratings for each age group of Ballot et al.’s database (ages 18–25, 26–39, 40–59, and over 60) in order to establish age-related consistencies and changes. Based on similar endeavours for concreteness (Pollock, 2018) and BOI ratings (Paisios et al., 2023), we expected a quadratic relationship between the average imageability rating of words and the standard deviation of the ratings, where middle-range ratings would be associated with systematically high disagreement levels between participants. Moreover, words with a lower subjective frequency (i.e., judged as infrequently encountered by participants) were expected to give rise to more inter-rater variability in imageability ratings. Finally, predictors of variability in imageability were expected to differ across age groups, as we believed subjective frequency to be a stronger predictor of inter-rater variability in imageability ratings for the youngest age groups.
Method
Materials
We retrieved imageability, subjective frequency, valence and arousal ratings from Ballot et al.’s (2022; see Gobin et al., 2017 for emotional variables) database, which contains age-related norms for 1,286 French words for the following age groups: 18–25, 26–39, 40–59, and over 60. Correlations between imageability ratings by age group are reported in Table 5. Both imageability and subjective frequency for each word were assessed on a seven-point rating scale by at least 30 participants in each of the four age groups. Averaged and raw data were made publicly available by Ballot et al. (2022) at https://osf.io/vhmub/?view_only=a4f2e3ecf68e4c669fecd28e5464e989.
Objective lexical characteristics, such as book and web word frequencies, word length and orthographic neighbourhood, were extracted from Lexique 3.83 (New et al., 2004).
However, as there were no AoA norms available for the majority of the words from Ballot et al. (2022), we collected new AoA ratings for this dataset. We recruited 469 French-speaking participants through mailing lists. Surveys were administered via the psytoolkit platform (Stoet, 2010, 2017). After giving informed consent, each participant was randomly assigned one of four lists of words (321 or 322 words per list). Using instructions adapted from Ferrand et al. (2008), participants indicated the age (in years) at which they believed to have first learned each known word1, before answering demographic questions. Participants were not compensated for their participation. We excluded participants who were not French-speaking or who had not completed the whole survey, resulting in a final sample of 172 participants (age M=23.32, SD=5.61; education years M=15.69, SD= 6.45). Each of the 1286 was rated by a total of 38 to 51 participants. To assess the validity of our ratings, we examined their correlation with the AoA norms from Ferrand et al. (2008). For the 152 words common to both datasets, the correlation between AoA ratings was very strong r=.93, p<.01, providing evidence in favour of the validity of our newly collected AoA norms.
Datasets were combined using the R tidyverse package (R Core Team, 2020; Wickham et al., 2019). The variables used as predictors in our analysis are described in Table 1. Correlations between all variables entered in the regression model are reported for each of the four age groups in the Appendix (Tables 7, 8, 9, and 10).
Table 1. Mean word characteristics (SD between parentheses) for the words of Ballot et al.’s (2022) database (N=1286)
Variables | 18–25 years old | 26–39 years old | 40–59 years old | Over 60 years old | Average |
|---|---|---|---|---|---|
Subjective ratings | |||||
Mean Imageability | 4.53 (1.45) | 4.77 (1.34) | 4.98 (1.18) | 5.26 (1.07) | 3.89 (1.11) |
SD of Imageability | 1.82 (0.54) | 1.94 (0.55) | 1.98 (0.51) | 1.83 (0.55) | 1.89 (0.54) |
Subjective frequency | 2.85 (1.00) | 2.73 (0.97) | 2.94 (0.98) | 3.04 (0.91) | 2.89 (0.97) |
SD of Subjective frequency | 1.19 (0.30) | 1.28 (0.31) | 1.34 (0.29) | 1.30 (0.26) | 1.28 (0.29) |
Valence | −0.09 (1.09) | −0.12 (1.14) | −0.13 (1.18) | −0.17 (1.21) | −0.13 (1.16) |
Arousal | 2.96 (0.80) | 3.00 (0.98) | 3.28 (0.88) | 3.29 (0.92) | 3.13 (0.90) |
AoA | 9.00 (2.84) | ||||
Corpus-based characteristics | |||||
Number of letters | 5.88 (0.79) | ||||
Number of syllables | 1.96 (0.60) | ||||
Book frequency (freqlivres) | 13.11 (39.76) | ||||
Web frequency (Wordlex) | 13.07 (48.43) | ||||
Number of orthographic neighbours | 3.11 (2.61) | ||||
Note : Subjective ratings of imageability and subjective frequency were taken from Ballot et al.’s (2022), arousal and valence were taken from Gobin et al. (2017) and AoA ratings were collected. Imageability, subjective frequency and arousal were rated on 7-points scales. Valence was rated on a scale going from −3 to +3. For the variable SD of Subjective frequency, the number between parentheses reflects to what extent inter-rater disagreement on subjective frequency rating (SD for a given word) could be homogenous (low global SD), i.e., affecting similarly all words from the dataset, or heterogenous (high global SD). Corpus-based characteristics were taken from Lexique 3.83 (New et al., 2004).
Design and analysis
We conducted a hierarchical regression analysis using the standard deviation in imageability ratings as our dependent variable. For each age group, we entered traditional objective lexical factors at Step 1: number of letters, number of syllables, book and web-based corpus frequencies, and number of orthographic neighbours. At Step 2, we included valence and arousal given their relationship with imageability ratings (Ballot et al., 2022; Kuperman et al., 2012). We then included subjective frequency at Step 3, as we expected this subjective factor to be the strongest predictor of variability in imageability ratings and thus wanted to test if the associated differential in R2 was significant. We also chose to include standard deviations of subjective frequency, as we believe this latter variable could be used as a proxy for inter-individual differences in participant’s familiarity with the stimuli. Given the strong correlation between subjective frequency and word imageability ratings (see Ballot et al., 2022), SD of subjective frequency could theoretically explain part of the disagreement in imageability ratings2. Finally, at Step 4, we entered mean ratings of AoA, a factor often associated with imageability, as well as linear and quadratic terms for imageability ratings to take the shape of the distribution into account (see Fig. 1). We report adjusted R2 to correct for the number of predictors entered into the model (for a comparable procedure, see Ballot et al., 2022; Balota et al., 2004).
[See PDF for image]
Fig. 1
Scatterplot of mean imageability ratings as a function of SD of imageability ratings by age group
Results and Discussion
All data wrangling and analyses were performed using R (R Core Team, 2020) via the R studio interface v.2023.3.1.446 (Posit Team, 2023).
Descriptive statistics on imageability ratings
Most words from Ballot et al.’s (2022) database exhibit standard deviations in imageability ratings exceeding 1 SD (see Fig. 1). Furthermore, 24% to 45% of the words showed SDs of 2 and above depending on the age group (see Table 6 for age-stratified raw and cumulative percentages of words per range of SD in imageability ratings). The quadratic shape of the distribution underlines that while performing imageability ratings, participants disagreed on the attributed imageability value of most words. As expected, in line with Paisios et al.’s (2023) conclusion of a “middle-scale disagreement” for BOI ratings, mean imageability ratings between 3 and 5 out of 7 points were almost exclusively associated with SDs equal to or greater than 2 (see Table 2). Therefore, the average imageability rating does not reflect the true imageability of middle-scale words. As such, these ratings cannot be used to make robust inferences about the way these words are processed without further insight. In addition to middle-range words, we found that inter-rater disagreement was also marked for words in the lower range of the imageability rating scale, for a total of about 70% of words with high SDs in imageability ratings. Altogether, our results suggest that the facilitatory imageability effect on word processing reported across the literature might instead result from the inter-rater disagreement level; or from whichever confounding variable drives inter-rater disagreement in imageability ratings.
Table 2. Hierarchical linear regression using SD of imageability ratings as the dependent variable, by age group
Variables | 18–25 years | 26–39 years | 40–59 years | over 60 years | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
β | R2 | ΔR2 | BIC | β | R2 | ΔR2 | BIC | β | R2 | ΔR2 | BIC | β | R2 | ΔR2 | BIC | |
Step 1 : Objective word variables | ||||||||||||||||
Number of letters | −0.02* | −0.02* | −0.02 | −0.00 | ||||||||||||
Number of syllables | −0.02* | −0.00 | −0.03** | −0.02 | ||||||||||||
Book frequency | 0.03** | 0.02 | 0.01 | −0.01 | ||||||||||||
Web frequency | 0.00 | 0.02* | 0.03* | −0.00 | ||||||||||||
Number of orthographic neighbours | −0.00 | −0.00 | −0.01 | −0.02 | ||||||||||||
.041** | 2003.49 | .06** | 2007.01 | .06** | 1842.67 | .09** | 1988.63 | |||||||||
Step 2 : Affective word variables | ||||||||||||||||
Valence | 0.01 | −0.01 | −0.01 | −0.02** | ||||||||||||
Arousal | −0.03** | −0.01 | −0.02** | −0.00 | ||||||||||||
.057** | .016*** | 1996.55 | .09** | .026*** | 1985.95 | .09** | .023*** | 1825.81 | .12** | .035** | 1954.40 | |||||
Step 3 : Subjective Frequency | ||||||||||||||||
Mean subjective frequency | −0.05** | −0.03* | −0.04* | 0.08** | ||||||||||||
Standard Deviation of subjective frequency | −0.07** | 0.01 | −0.02 | −0.01 | ||||||||||||
.13** | .073*** | 1903.58 | .16** | .071*** | 1899.03 | .14** | .054*** | 1758.55 | .13** | .014*** | 1948.21 | |||||
Step 4 : Semantic variables | ||||||||||||||||
Linear Mean imageability | −0.97** | −0.88** | −0.80** | −0.66** | ||||||||||||
Quadratic Mean imageability | −0.84** | −0.70** | −0.55** | −0.49** | ||||||||||||
Age of Acquisition (AoA) | 0.04* | 0.05** | −0.01 | 0.03* | ||||||||||||
.90** | .77*** | −746.85 | .92** | .90*** | −1028.40 | .92** | .078*** | −1103.18 | .91** | .78*** | −872.73 | |||||
Note. * p<.05 ; ** p<.01 ; *** p<.001; β = Standardized coefficient for the predictor; R2 = Adjusted R2; ΔR2 = Additional explained variance at a given regression step. For the variable SD of Subjective frequency, the number between parentheses reflects to what extent inter-rater disagreement on subjective frequency rating (SD for a given word) could be homogenous (low global SD), i.e., affecting similarly all words from the dataset, or heterogenous (high global SD). All variables were centered before analyses
Considering the large proportion of words for which participants disagree when rating imageability (see also Pollock, 2018 for similar results on concreteness), different mechanisms could give rise to a facilitatory pattern of imageability ratings on word processing, depending on whether said ratings are consensual or not. For consensual words, which are typically highly imageable, the reactivation of sensorimotor information would reliably facilitate processing for all participants. In contrast, for non-consensual words, facilitation by imageability may vary greatly due to inter-individual differences, resulting in noisy measures and reducing or suppressing any facilitatory imageability effect when averaging performance across participants. Finally, as a lack of consensus in ratings may stem from word ambiguity (see Huete-Pérez et al., 2020), a processing disadvantage for non-consensual words could emerge from semantic competition between significations (Rodd, 2020). This would increase the contrast in performance between consensual imageable words and non-consensual and/or lower imageability words. Consequently, at this stage, one may assume that the typical processing advantage for highly imageable words could be an artefact of the participants’ consensuality, which would account for some of the reported fluctuations in imageability effects depending on the words and participants included. This assumption warrants further examination, which we did through Study 2.
As expected, the distribution of SDs in imageability ratings varied significantly between age groups, F(3,5140)=27.62, p<.0001, η2=.02. Post hoc comparisons using the Tukey HSD test revealed that SDs in imageability ratings were lower in the 18–25 (M=1,82; SD=0,54) and the over 60 (M=1,83; SD=0,55) than in the 26–39 (M=1,94; SD=0,55) and 40–59 (M=1,98; SD=0,51) age groups (p<.0001). In line with Wulff et al.’s (2022) study, young adults probably share closer sensorimotor experiences and a common linguistic background as a result of the normativity of formal schooling. The higher inter-individual variability in mid-adulthood may reflect a higher heterogeneity in middle-aged adults. Meanwhile, in middle adulthood, individuals would diverge in their personal and professional trajectories and acquire more varied yet individualized linguistic and sensorimotor experiences (Wulff et al., 2022). In older adults, the cessation of professional activities (Hommel & Kibele, 2016) and the overall breadth of lexical knowledge might reduce the impact of heterogenous experiences from middle-age. These explanations are in line with previous studies suggesting that age-specific semantic organization is crucial to predicting lexico-semantic behaviours across the lifespan (Krethlow et al., 2020; Wulff et al., 2022). Age-related differences in participant disagreement when performing imageability ratings emphasize the importance of age-inclusive ratings. They also provide further evidence of modifications in the lexico-semantic organization during the adult lifespan, in addition to those suggested by the age-related skewness of average imageability ratings demonstrated by Ballot et al. (2022). According to previous work, vocabulary, and consequently the lexico-semantic network, would increase in size until ages between 65 and 70 years (e.g., Keuleers et al., 2015). Furthermore, the semantic networks of older adults exhibit considerably more inter-individual variability than those of younger adults, and this variability was found to change depending on individual educational level for older adults only (Wulff et al., 2022). Overall, these elements support the idea that lifelong cumulative exposure to linguistic and sensorimotor experiences contributes to individual differences in lexico-semantic processing. Finally, differences in rating behaviour over the adult lifespan might result from an interplay between a larger, more heterogenous and less connected semantic network (Wulff et al., 2022) and reduced interoceptive sensations (MacCormack et al., 2021) in older adults, which could increase the reliance on non-sensorimotor information as a form of compensation (e.g., Shafto et al., 2012).
Lexical predictors of disagreement (SD) in imageability ratings
As reported in Table 2, longer words and words with higher objective frequencies produced more disagreement in imageability ratings than shorter and less frequent words3, in all age groups except the oldest. The effect of word length is consistent with Reilly and Kean’s (2007) finding that this factor predicted average imageability ratings. These authors reported that longer words were also less concrete, more complex and acquired later in life (see also Bird et al., 2001). Such characteristics of longer words would explain increased variability in imageability ratings, depending on the linguistic experience of participants. The effect of objective frequency on the level of disagreement might be explained by the fact that frequent words can occur in a wider range of contexts (e.g., Adelman et al., 2006). Contextual diversity is associated with increased semantic ambiguity (Hoffman et al., 2013), which has been shown to result in averaged ratings failing to reflect the different meanings associated with some words (e.g., Huete-Pérez et al., 2020 for word emotionality). Altogether, objective lexical factors did not predict more than 9% of variance, suggesting that inter-rater disagreement in imageability ratings might be due to more subjective factors (Step 1).
High disagreement levels in imageability ratings were also associated with lower emotional content of the words (for all age groups except the 26–39-year-olds). This finding seems consistent with studies reporting that although they refer to mostly intangible concepts, emotional words are generally easier to simulate owing to their interoceptive dimension (Abbassi et al., 2015; Connell et al., 2018) (Step 2).
Furthermore, as predicted, the average subjective frequency of words predicted SD of imageability ratings above predictors entered at previous steps (Step 3). Words perceived by participants as being more frequent generated less inter-rater disagreement than those perceived as less frequently encountered across all age groups. The difference in direction between the effects of subjective and objective frequency on participants’ performance, along with the higher explanatory power reported for subjective frequency in previous studies (e.g., Ballot et al., 2022), may be interpreted as further confirmation that subjective frequency is an important indicator of participants’ lexico-semantic organization and that both types of indicators should be used conjointly. Unexpectedly, in the 18–25 group only, disagreement in imageability ratings was lower for words with higher variability in subjective frequency ratings. However, this result probably originates from the strong positive correlation between SD and mean of subjective frequency ratings r=.63 to .70, p<.01, as participants might have difficulties in judging gradations between frequently and very frequently encountered words (4/7=once a week; 5/7=once every two days; 6/7= once a day; 7/7= several times a day) while agreeing more easily on the rate of occurrence for rarer ones (1/7= never encountered; 2/7 = once a year; 3/7= once a month).
Finally, words with a higher AoA, that is words that were learned later in life were associated with higher levels of inter-rater disagreement in imageability ratings (across all age groups except the 40–59 years old). Before the addition of mean imageability terms, R2 of the models ranged between .13 and a maximum of .16 depending on the age group. This suggests that in addition to the lexico-semantic and emotional characteristics considered in this study, other predictors (which may relate to the participants themselves) should be investigated in future research to determine what causes a word to induce disagreement about its imageability value, and for which participants. The addition of linear and squared mean terms for imageability ratings accounted for 77% to 90% of additional variance, further emphasizing the extent to which average imageability ratings provided in lexical databases are driven by the underlying dispersion of individual ratings. It is therefore important to take SDs of imageability ratings into account when using verbal material in experiments (Step 4).
Study 2: Virtual Experiment
Results from Study 1 show that for a large set of words in the imageability database (Ballot et al., 2022), standard deviations of mean imageability ratings were well above 1 SD (71% to 78% of words depending on raters' age). Other imageability databases exhibit a similar pattern (more than 50% above 1 SD in Cortese & Fugett, 2004 and Schock et al., 2012 imageability norms aggregated datasets, see Pollock, 2018). High inter-rater disagreement in semantic ratings has direct implications for the results of experimental studies using word materials and their theoretical interpretations (Pollock, 2018; Winter, 2022). When some participants rate a word on the lower end of the imageability scale while others rate it on the higher end, the validity of the average value and its ability to predict performance are questioned. When replicating concreteness-based memory experiments, Pollock (2018) found that the classic facilitation of memory performance by word concreteness failed to replicate in two out of three studies when controlling for SDs of concreteness ratings. However, the impact of SD in imageability ratings on imageability effects in word processing remains unknown, even though the typical imageability effect could end up being a disagreement effect in disguise.
The main aim of Study 2 was therefore to investigate the influence of high SDs in imageability ratings on visual word recognition performance in young adults, using behavioural data from the Megalex megastudy (Ferrand et al., 2018) and imageability ratings from the same database as in Study 1 (Ballot et al., 2022). We expected that in addition to the classic predictors of visual word recognition performance (see Ballot et al., 2022), SDs of imageability ratings might explain part of the variance in performance in the LDT. We also expected the facilitatory effect of imageability to be all the more reduced as SDs in imageability ratings were high. The second aim of this study was to focus on words exhibiting the highest disagreement levels in imageability ratings, as identified in Study 1, to determine how participants process these specific words. We hypothesized that SDs of imageability ratings should be a better predictor of performance in the LDT than mean imageability ratings for the subset comprising the least consensual words.
Method
Materials
We crossed the set of 1,286 words described in Study 1 with response times (RTs) and accuracy from the visual LDT of the Megalex database (Ferrand et al., 2018), which resulted in a sample of 882 words for which all data was available.
Design and analysis
In the Megalex study, participants’ ages ranged between 17 and 52 years old (mean = 25.76, SD = 5.71). Most participants were therefore between 20 and 30 years old (Ferrand et al., 2018). We used the subjective ratings most closely matching this age bracket, i.e., an average between the ratings from the 18–25 and the 26–39 groups. We used a three-step hierarchical approach for the item-level analyses to specify the contributions of standard deviations of imageability ratings in comparison with average imageability ratings and other lexical factors known to influence visual word recognition. Given the strong quadratic relationship demonstrated in Study 1 between average ratings and standard deviations for imageability ratings, we included an interaction term between those two predictors in our model. We reported adjusted R2 to take the number of predictors into account. In Step 1, we entered the objective lexical variables retrieved in the most recent version of Lexique (v 3.83): number of letters, number of syllables, book- and web-based (blog, twitter and newspaper) objective frequencies, and number of orthographic neighbours. Objective frequency indicators were transformed into Zipf scores (Brysbaert et al., 2018). In Step 2, we entered subjective variables of interest (for a similar procedure, see Ballot et al., 2022): mean AoA, mean and SD for imageability and subjective frequency, as well as the interaction term between mean and SD of imageability. In the last step, we added affective variables, i.e., valence and arousal, as they are known to influence visual word recognition performance (e.g., Kuperman et al., 2014).
Results and discussion
Predictors of performance in the LDT for all 882 words
Correlations between all variables entered in the model are reported in supplementary Table 12. When considering the whole subset of 882 words for which both imageability (Ballot et al. 2022) and visual LDT data (Ferrand et al., 2018) were available, we found that beyond objective predictors of visual word recognition, which accounted for a total of 47 % of the variance in RTs and 31% of the variance in accuracy in the LDT (p<.01) (Step 1), subjective lexical variables accounted for an additional 11% of variance in RTs and 10% for accuracy (p<.001) (Step 2). As we expected, mean subjective frequency facilitated both RTs and accuracy (r2<.01, p<.01), as did AoA (r2= .02, p<.01). Additionally, performance in the LDT was facilitated by the SD of subjective frequency for both RTs and accuracy (r2<.01). More importantly, the facilitatory effect of mean imageability ratings on both RTs and accuracy in the LDT remained significant when accounting for SDs of imageability ratings (r2<.01) despite the large proportion of words for which the said SDs were above 2 (see Table 3). Contrary to our expectations, SDs of imageability ratings did not predict the LDT performance (p=.25 for RTs; p=.26 for accuracy), nor did they modify the effects of mean imageability (p=.20 for RTs; p=.62 for accuracy).
Table 3. Hierarchical linear regression using RTs and accuracy in the LDT (Megalex) as the dependent variables (N= 882 words)
Variables | RTs | Accuracy | ||||||
|---|---|---|---|---|---|---|---|---|
β | R2 | ΔR2 | BIC | β | R2 | ΔR2 | BIC | |
Step 1 : Objective variables | ||||||||
Number of letters | .013 | .13* | ||||||
Number of syllables | .05 | -.014 | ||||||
Book frequency | -.08* | -.02 | ||||||
Web frequency | -.34** | .33** | ||||||
Number of orthographic neighbours | -.017 | .023 | ||||||
.47** | 8858.47 | .31** | −2288.69 | |||||
Step 2 : Subjective variables | ||||||||
M subjective frequency | -.13** | .076 | ||||||
SD of subjective frequency | -.061* | .12** | ||||||
M imageability | -.11* | .01* | ||||||
SD of imageability | -.071 | .082 | ||||||
M imageability x SD of imageability | .048 | -.021 | ||||||
Age of Acquisition (AoA) | .25** | -.21** | ||||||
.58** | .11*** | 8692.75 | .41** | .10*** | −2395.16 | |||
Step 3 : Affective variables | ||||||||
Word valence | .017 | -.003 | ||||||
Word arousal | -.046* | .086** | ||||||
.59** | .007** | 8692.40 | .42** | .010*** | −2391.79 | |||
Note. * p<.05 ; ** p<.01 ; *** p<.001; β = Standardized coefficient for the predictor; R2 = Adjusted R2; ΔR2 = Additional explained variance at a given regression step; M= Mean; SD = Standard Deviation. For the variable SD of Subjective frequency, the number between parentheses reflects to what extent inter-rater disagreement on subjective frequency rating (SD for a given word) could be homogenous (low global SD), i.e., affecting similarly all words from the dataset, or heterogenous (high global SD). All predictors were centered before analyses
These results suggest that, reassuringly, inter-rater disagreement in imageability ratings does not interfere with the facilitatory effect associated with high imageability in the LDT, at least at the group level, although it might introduce some noise in the measurement of performance. Beyond the confirmation of the usefulness of imageability ratings, these findings raise further questions. It seems likely that participants cannot reliably evaluate the intensity of their mental imagery (Connell & Lynott, 2016) when performing imageability judgements, unless the words are highly imageable, giving rise to an unambiguously strong sensorimotor simulation, or unless the words do not reactivate any sensorimotor information, in which case they will always be rated as non-imageable. With the exception of those two extreme cases, it is possible that a facilitatory sensorimotor reactivation still occurs during word processing, but that participants are unreliable in perceiving and reporting it due to inter-individual differences in interoceptive abilities.
Finally, words rated as more arousing were associated with shorter RTs (r2<.01) and improved accuracy (r2=.01) (Step3), in line with previous studies (e.g., Ballot et al., 2022; Citron et al., 2014; Kuperman et al., 2014; Siakaluk et al., 2016).
Predictors of LDT performance for 649 words with non-consensual imageability ratings (SDs over 1.5)
When considering a subset of 649 words for which imageability ratings were not consensual (see Table 4), we found that as for the whole set of 882 words, objective predictors of visual word recognition accounted for a total of 47 % of the variance in RTs and 32% of the variance in accuracy in the LDT (p<.001) (Step 1), while subjective lexical variables accounted for an additional 10% of the variance for both RTs and accuracy (p<.001) (Step 2). The effect of all the predictors remained constant across both word sets, with the exception of SD of subjective frequency, which still predicted response accuracy (r2=.01, p<.001 ) but not RTs (p=.08 ) in the LDT. In contrast with our predictions, but in line with our findings on the whole set of words, the facilitatory effect of mean imageability remained significant for both RTs (r2<.01, p<.05) or accuracy (r2=.01, p<.01), while SDs of imageability ratings did not predict LDT performance for this subset of words (p=.66 for RTs; p=.98 for accuracy). These findings further support our conclusions regarding the robustness of the imageability effects and confirm that inter-rater disagreement does not interact with the facilitatory effect of mean imageability within our dataset. Finally, as for the whole subset of 882 words, RTs in the LDT were facilitated for more arousing words, and the whole model accounted for 59% of the variance in RTs and 42% of the variance in response accuracy (Step 3).
Table 4. Hierarchical linear regression using RTs and accuracy in the LDT as the dependent variables for words above 1.5 SD in imageability ratings (N=649)
Variables | RTs | Accuracy | ||||||
|---|---|---|---|---|---|---|---|---|
β | R2 | ΔR2 | BIC | β | R2 | ΔR2 | BIC | |
Step 1 : Objective variables | ||||||||
Number of letters | -.013 | .10** | ||||||
Number of syllables | .060 | .031 | ||||||
Book frequency | -.047 | -.031 | ||||||
Web Frequency | -.032** | .32** | ||||||
Number of orthographic neighbours | -.037 | .030 | ||||||
.47** | 6568.69 | .32** | −1598.27 | |||||
Step 2 : Subjective Variables | ||||||||
M Subjective frequency | -.20** | .13* | ||||||
SD of Subjective frequency | -.061 | .12** | ||||||
M Imageability | -.090* | .16** | ||||||
SD of Imageability | -.038 | .042 | ||||||
M imageability x SD of Imageability | -.010 | .027 | ||||||
Age of Acquisition (AoA) | .22** | -.16** | ||||||
.57** | .10*** | 6465.12 | .42** | .10*** | −1662.08 | |||
Step 3 : Affective variables | ||||||||
Word valence | .024 | -.021 | ||||||
Word arousal | -.057* | .092** | ||||||
.58** | .005** | 6463.06 | .43** | .009** | -−1656.44 | |||
Note. * p<.05 ; ** p<.01 ; *** p<.001; β = Standardized coefficient for the predictor; R2 = Adjusted R2; ΔR2 = Additional explained variance at a given regression step; M= Mean; SD = Standard Deviation. For the variable SD of Subjective frequency, the number between parentheses reflects to what extent inter-rater disagreement on subjective frequency rating (SD for a given word) could be homogenous (low global SD), i.e., affecting similarly all words from the dataset, or heterogenous (high global SD). All predictors were centered before analyses
To evaluate the generalizability of predictors associated with small effect sizes in our regression models, we conducted a tenfold cross-validation4 on each full model using the caret package (Kuhn, 2008). These analyses mitigate the risks of overfitting and spurious correlations by training and fitting a statistical model on randomized partitions of a dataset over several iterations. Results from the cross-validations confirmed the results from our linear regression for all predictors. More specifically, these analyses confirmed that mean imageability is indeed a stable predictor of performance in the LDT for both RTs (regression coefficient = −4.14 ) and accuracy of response (regression coefficient = .01), as were mean subjective frequency (regression coefficients = −7.04 for RTs; .01 for accuracy) and SDs of subjective frequency (regression coefficients = −13.5 for RTs; .04 for accuracy). Regarding importance for the final model, these predictors ranked respectively as 7 out of 13 for mean imageability, 3 out of 13 for mean subjective frequency, and 4 out of 13 for SD of subjective frequency. Although the facilitation of LDT performance by SDs of subjective frequency is somewhat surprising, this result can be explained through the strong positive correlation between mean subjective frequency ratings and SDs of subjective frequency r= .61, p <.01, which we discussed in Study 1. Altogether, these results provide further evidence of the robustness of the imageability effect and confirm the usefulness of imageability ratings in psycholinguistic investigations.
General Discussion
The present paper investigated inter-rater disagreement within normative imageability ratings. We examined age-related predictors of variability around average imageability ratings and assessed the repercussions of such variability on visual word recognition performance. We found that middle-scale words displayed the most extreme inter-rater disagreement in imageability ratings and that predictors of this variability differed between age groups (Study 1). By re-examining imageability effects in visual word recognition, we showed that the facilitation of LDT performance by mean word imageability persisted even when controlling for SDs of imageability ratings, and was not moderated by inter-rater disagreement (Study 2). We hereafter discuss these results and their implications regarding lexico-semantic processing and psycholinguistic investigations.
Middle-scale disagreement
Lexico-semantic norms, especially word-imageability norms, are an omnipresent and essential tool for psycholinguistic research. They facilitate the reproducibility of materials and generalisability of results by reducing noise resulting from inter-individual subjectivity across studies. While understanding the structure of norms is crucial to our inferences about the underlying properties of words and semantic processing within the lexicon, little is known about the possible causes of disagreement in imageability ratings. In Study 1, we found that more than half of the words from a French imageability database (Ballot et al., 2022) displayed high levels of inter-rater disagreement, as was reported for large English imageability databases (see Pollock, 2018). We demonstrated that the average imageability rating for a given word is a function of standard deviation (SD) in the individual imageability ratings, i.e., of the disagreement between raters during norm collection, following a quadratic (U-shaped) distribution, in line with comparable work on concreteness (Pollock, 2018).
Interestingly, inter-rater disagreement was the highest for words from the middle range of the imageability scale. This finding is consistent with recent work on BOI ratings (Paisios et al., 2023), which emphasizes the question of construct validity for middle-range words. Although using Likert-style rating scales assumes a somewhat continuous underlying distribution, this assumption is questioned by the consistently high levels of disagreement observed for middle-scale imageability ratings (and to a lesser extent, for the lower end of the imageability scale below 3). With values equal to or well above two SDs, middle-range imageability ratings reflect such a high disagreement level between participants that their average rating cannot be interpreted as reflecting medium imageability at the psychological level. Such elevated inter-rater disagreement indicates that for an average value situated between three and five out of seven, the participants from the norming study systematically gave responses ranging from 1 to 7 for the same word (Pollock, 2018; Winter, 2022). The absence of words for which the average imageability rating would be consensual in the middle-range of imageability scales indicates that participants do not process word imageability as a continuum of imaging ease, but rather as something that is either very difficult or very easy to access. Therefore, the average imageability rating for these words cannot be used as a proxy for their imageability at a psychological level. These findings are also in line with recent work on BOI ratings and middle-range values of Likert-type scales (Paisios et al., 2023). Imageability norms collected using slider scales would likely yield similar results, as this pattern seems to be due to the participants’ conception of imageability or of the instructions they receive to perform the ratings, rather than to a sole characteristic of the rating scale.
Age and the apple(s) of discord: predictors of inter-rater disagreement
As expected, the proportion of non-consensual words differed significantly between age groups. In Study 1, we also found that middle-aged adults, a rather understudied population in psycholinguistics, were more subject to variability in their assessment of word imageability than very young (18–25) or older adults (60 and over). This result points towards more diversity in lexico-semantic processing in individuals aged 26 to 59 years old. This increased variability may originate from the divergence in professional and personal trajectories in middle-aged adults, which would give rise to increased heterogeneity in their sensorimotor and linguistic experience (Wulff et al., 2022), whereas younger adults share more homogenous experiences due to the normativity of formal education and of the experiences leading up to early adulthood. On the other end of the adulthood spectrum, the cessation of professional activities (Hommel & Kibele, 2016) and the overall width of lexical knowledge might mitigate the influence of previous diversity in experience. This reasoning is in line with embodied theories of lexico-semantic processing, which propose that differences in lived experience influence linguistic processes (e.g., Barsalou, 2008). However, it cannot be excluded that increased inter-rater disagreement for middle-aged adults might also reflect sampling from a more heterogeneous population than students or active community dweller older adults. Further studies are necessary to investigate this issue.
Overall, words that were less emotional generated more disagreement across all age groups, except for early middle-aged adults. This result corroborates the view according to which emotion is a part of embodiment and of word imageability (Abbassi et al., 2015) and reinforces the notion that lexico-semantic processing diverges in middle-age (Wulff et al., 2022). Higher inter-rater disagreement in imageability ratings was also associated with words of greater length and AoA. This finding is consistent with previous studies on the lesser imageability of morphologically complex words, which tend to be acquired later in life and are associated with concepts less grounded in sensorimotor representations (e.g., Reilly & Kean, 2007). Besides average imageability, as expected, the most important predictor of disagreement in imageability ratings was how frequent words were for participants. While subjective frequency predicted lower inter-rater disagreement, higher objective frequency was associated with higher levels of disagreement across all age groups. However, for the younger participants, disagreement levels in imageability ratings were higher for words on which subjective frequency was consensual across all participants, a result which probably stems from the high correlation between inter-rater disagreement and mean of subjective frequency ratings. Indeed, younger participants might find it difficult to differentiate between gradation of occurrence for the higher end of the frequency scale, while agreeing more easily on rarer ones which would be more salient within their smaller lexico-semantic network (Wulff et al., 2022). The overall stability of the lexico-predictors of inter-rater disagreement between age groups, contrasting with the significant differences in rating behaviours, supports the idea that inter-rater disagreement heavily relies on inter-individual differences which can evolve with age. Furthermore, these age-related differences in rating behaviours are in line with recent work providing evidence of changes in lexico-semantic properties over the lifespan (Ballot et al., 2022; Krethlow et al., 2020; Wulff et al., 2016, 2022). Lifelong exposure to linguistic and sensorimotor experiences may contribute to the increasing inter-individual diversity in the structure of one’s semantic network (Wulff et al., 2022). Thus, changes in rating behaviour across the lifespan might be related to increased inter and intra heterogeneity of the mental lexicon, but also to modifications in other cognitive processes such as interoception, which is typically reduced in older adults (MacCormack et al., 2021) and could result in knowledge-based compensation (Shafto et al., 2012).
Overall, these results add to the growing list of studies questioning the traditional psycholinguistic premise that linguistic processes are uniform across expert readers. They underline the importance of investigating age-related changes, as well as other possible sources of inter-individual differences in lexico-semantic processing.
Imageability effects, visual word recognition and disagreement in imageability ratings
The high number of words found in Study 1 for which individual ratings varied considerably around the average imageability value could illustrate the fact that the facilitatory effect of imageability on word processing reported in the literature is driven by inter-rater disagreement in imageability ratings. In Study 2, however, mean word imageability facilitated visual word recognition performance, even when inter-rater disagreement in imageability ratings was controlled for. This result provides reassuring evidence that imageability effects are unlikely to be artefacts of inter-rater disagreement in imageability ratings and reasserts the overall validity of imageability norms.
As opposed to what we expected given the number of non-consensual words identified in Study 1 and the theoretical implications regarding the validity of averaged ratings, disagreement in imageability ratings was not found to influence visual word recognition performance, nor to modify the typical imageability effect at group level. Furthermore, when focusing on words on which participants agreed the least when performing imageability ratings, group-level performance in visual word recognition was still predicted by average imageability rating, but mainly depended on word frequency, age of acquisition, and participants’ familiarity with the words (Study 2).
Although disagreement-inducing words were prevalent and followed a striking distribution, non-consensual imageability ratings did not translate into modifications of word processing. These results are in contrast with findings from Pollock (2018) on the variability of concreteness norms and their influence on memory performance. All happens as if, contrary to concreteness which is perceived by participants as a discrete dichotomic phenomenon (Pollock, 2018), imageability is treated as a more continuous phenomenon, but one for which participants are unreliable judges outside of the extremities of the scale. We must consider the possibility, however, that megastudies, through the number of stimuli and participants they aggregate, allow for circumventing any interference from inter-rater disagreement on imageability effects. If, as a whole, imageability norms are robust enough, we believe that researchers working with smaller datasets, such as the ones used in factorial designs, should pay attention to inter-rater disagreement around imageability (or other subjective ratings), as elevated values do question what the average rating actually measures. Given the pervasiveness and distribution of high inter-rater disagreement within imageability norms, it seems difficult to definitely rule out the possibility of imageability effects reflecting, at least partially, the effects of rating disagreement. While for highly imageable and consensual words, sensorimotor reactivation consistently aids word processing, for non-consensual words, individual differences or inhibition stemming from semantic competition between meanings (Rodd, 2020; Huete-Pérez et al., 2020) could lead to variable facilitation, reducing or negating the average effect on performance. As such, the cause for the particularity of non-consensual words, especially middle-range ones, still needs to be thoroughly investigated in the future in order to clarify the implications for the construct validity of these ratings.
Altogether, these results provide further evidence of the robustness of the imageability effect and confirm the usefulness of imageability ratings in psycholinguistic investigations. Those interpretations do tie in with arguments from previous work suggesting that 1) imageability effects, and more generally reputedly embodied effects, do not require the mobilization of conscious imagery (e.g., Dymarska et al., 2023; Ibanez et al., 2022); 2) people have difficulties in translating unconscious simulations into conscious imagery (Connell & Lynott, 2016); and 3) that psycholinguistic norms do indeed benefit from the “wisdom of the crowd” (Winter, 2022, p. 492) as collective judgments formed from numerous observations give rise to a more reliable measure (Paisios et al., 2023). Ultimately, these points highlight the importance of simultaneously considering individual differences and their interplay with psycholinguistic variables in order to improve our understanding of how sensory experiences can influence language processing.
Conclusion
Reassuringly, the present paper supports the robustness of imageability effects in visual word recognition, as they persist even when considering the numerous words exhibiting high disagreement levels in imageability ratings. It is therefore unlikely that the classic imageability effects reported in the literature are artefacts of disagreement levels. Further investigations are needed to entirely rule out this eventuality. Still, our results confirm the overall validity of imageability measures when studying word meaning. Although inter-rater disagreement does not impact word processing performance in large datasets, words with middle-range imageability ratings should still be considered cautiously, as their average imageability rating reveals little of their semantic properties other than the fact that something yet to be identified makes them disagreement-prone for participants. When investigating or controlling for imageability, the exclusion of middle-range words with exceedingly high SDs would thus certainly allow for more confidence in experimental results and their interpretations, particularly in smaller datasets which are more susceptible to measurement noise. Work on variability within imageability (and other semantic) ratings should probably be extended to larger datasets and other languages to determine the extent of inter-individual variability in the perception and appraisal of meaning-related word characteristics. Moreover, the differing proportion of non-consensual words in imageability ratings and the difference in lexical predictors of this variability between age groups underline the importance of age-inclusive ratings and of investigating semantic processing in other populations than the predominantly studied young adults. Ultimately, it appears that people have difficulties in translating unconscious simulations into conscious imagery (Connell & Lynott, 2016), a fact that could relate to individual differences in interoceptive abilities. Further work remains necessary to pinpoint inter-participant sources of variability in semantic ratings and to specify the reasons for which someone might judge a word as being easily imageable whilst their neighbour cannot picture it.
Authors’ contributions
The first author conceptualized the study, performed data manipulation and statistical analyses, and drafted the initial manuscript. The second author supervised the research, contributed to the study's conceptualization and the writing of the manuscript. Both authors contributed equally to the revisions and editing of the manuscript and approved the final manuscript.
Funding
This work was funded by a doctoral research grant awarded to the first author by the French Ministry for Higher Education and Research (MESR grant n°2020-NM-88).
Availability of data and materials
All data and materials are available at osf.io.
Code availability
R code for all main analyses is available at osf.io.
Declarations
Conflicts of interest
Not applicable.
Ethics approval
No approval of research ethics committees was required to accomplish the goals of this research as it re-used previously published behavioural data. Norms collection for AoA fell under an IRB approval from the University of Bordeaux granted to the authors for ongoing work on lexical ratings (2023.03.CLE009).
Consent to participate
Informed consent was obtained from all individual participants included in the data collection for the AoA norms.
Consent for publication
Not applicable.
Participants were asked to give an answer of « 0 years » as the age of learning for any word they did not know. These observations were excluded from the calculations of the ratings.
2We do not think such a reasoning to be relevant for the emotional characteristics of the words, which display weak correlations with imageability ratings, and chose not to include SDs of emotional variables in the model.
3In order to reduce the risk of overfitting led by spurious correlations, we performed tenfold cross-validations for all step 4 models using the caret package (Kuhn, 2008). These analyses confirmed the generalizability of our predictors.
4We thank Marc Brysbaert for recommending this analysis.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open practices statement
Materials, including the newly collected AoA norms, and R code used for analyses of Study 1 and Study 2 are available at https://osf.io/yunr2/?view_only=aa96e96b95874c678f2b774f727082be
References
Abbassi, E., Blanchette, I., Ansaldo, A. I., Ghassemzadeh, H., & Joanette, Y. (2015). Emotional words can be embodied or disembodied: The role of superficial vs. deep types of processing. Frontiers in Psychology, 6. https://doi.org/10.3389/fpsyg.2015.00975
Acar, EA; Zeyrek, D; Kurfali, M; Bozşahin, C. A Turkish database for psycholinguistic studies based on frequency, age of acquisition, and imageability. Proceedings of the Tenth International Conference on Language Resources and Evaluation; 2016; European Language Resources Association (ELRA): pp. 3600-3606.
Adelman, JS; Brown, GDA; Quesada, JF. Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science; 2006; 17,
Andrews, S. Pollatsek, A; Treiman, R. Individual Differences among skilled readers: The role of lexical quality. The Oxford handbook of reading; 2015; Oxford University Press: pp. 129-148.
Ballot, C; Mathey, S; Robert, C. Word imageability and orthographic neighbourhood effects on memory: A study in free recall and recognition. Memory; 2021; 29,
Ballot, C; Mathey, S; Robert, C. Age-related evaluations of imageability and subjective frequency for 1286 neutral and emotional French words: Ratings by young, middle-aged, and older adults. Behavior Research Methods; 2022; 54,
Balota, DA; Cortese, MJ; Sergent-Marshall, SD; Spieler, DH; Yap, M. Visual word recognition of single-syllable words. Journal of Experimental Psychology. General; 2004; 133,
Barsalou, LW. Grounded Cognition. Annual Review of Psychology; 2008; 59,
Barsalou, L. W. (2020). Challenges and opportunities for grounding cognition. Journal of Cognition, 3(1). https://doi.org/10.5334/joc.116
Bird, H; Franklin, S; Howard, D. Age of acquisition and imageability ratings for a large set of words, including verbs and function words. Behavior Research Methods, Instruments, and Computers; 2001; 33,
Bonin, P; Méot, A; Bugaiska, A. Concreteness norms for 1,659 French words: Relationships with other psycholinguistic variables and word recognition times. Behavior Research Methods; 2018; 50,
Brysbaert, M; Lange, M; Wijnendaele, IV. The effects of age-of-acquisition and frequency-of-occurrence in visual word recognition: Further evidence from the Dutch language. European Journal of Cognitive Psychology; 2000; 12,
Brysbaert, M; Warriner, AB; Kuperman, V. Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods; 2014; 46,
Brysbaert, M; Mandera, P; Keuleers, E. The word frequency effect in word processing: An updated review. Current Directions in Psychological Science; 2018; 27,
Citron, FMM; Weekes, BS; Ferstl, EC. Arousal and emotional valence interact in written word recognition. Language, Cognition and Neuroscience; 2014; 29,
Connell, L; Lynott, D. Do we know what we’re simulating? Information loss on transferring unconscious perceptual simulation to conscious imagery. Journal of Experimental Psychology: Learning, Memory, and Cognition; 2016; 42,
Connell, L; Lynott, D. Coello, Y; Fischer, M. Embodied Semantic Effects in Visual Word Recognition. Foundations of Embodied Cognition: Conceptual and Interactive Embodiment; 2018; vol. 2 Psychology Press: pp. 71-89. [DOI: https://dx.doi.org/10.31234/osf.io/cgs85]
Connell, L; Lynott, D; Banks, B. Interoception: The forgotten modality in perceptual grounding of abstract and concrete concepts. Philosophical Transactions of the Royal Society B: Biological Sciences; 2018; 373,
Cortese, MJ; Fugett, A. Imageability ratings for 3,000 monosyllabic words. Behavior Research Methods, Instruments, & Computers; 2004; 36,
Cortese, MJ; Schock, J. Imageability and age of acquisition effects in disyllabic word recognition. The Quarterly Journal of Experimental Psychology; 2013; 66, pp. 946-972. [DOI: https://dx.doi.org/10.1080/17470218.2012.722660] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/23030642]
Desrochers, A; Thompson, GL. Subjective frequency and imageability ratings for 3,600 French nouns. Behavior Research Methods; 2009; 41,
Dujardin, E; Jobard, G; Vahine, T; Mathey, S. Norms of vocabulary, reading, and spelling tests in French university students. Behavior Research Methods; 2022; 54, pp. 1611-1625. [DOI: https://dx.doi.org/10.3758/s13428-021-01684-5] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34505996]
Dymarska, A; Connell, L; Banks, B. Weaker than you might imagine: Determining imageability effects on word recognition. Journal of Memory and Language; 2023; 129, 104398. [DOI: https://dx.doi.org/10.1016/j.jml.2022.104398]
Evans, GAL; Lambon Ralph, MA; Woollams, AM. What’s in a word? A parametric study of semantic influences on visual word recognition. Psychonomic Bulletin & Review; 2012; 19,
Ferrand, L; Bonin, P; Méot, A; Augustinova, M; New, B; Pallier, C; Brysbaert, M. Age-of-acquisition and subjective frequency estimates for all generally known monosyllabic French words and their relation with other psycholinguistic variables. Behavior Research Methods; 2008; 40,
Ferrand, L., Brysbaert, M., Keuleers, E., New, B., Bonin, P., Méot, A., Augustinova, M., & Pallier, C. (2011). Comparing Word Processing Times in Naming, Lexical Decision, and Progressive Demasking: Evidence from Chronolex. Frontiers in Psychology, 2. https://doi.org/10.3389/fpsyg.2011.00306
Ferrand, L; Méot, A; Spinelli, E; New, B; Pallier, C; Bonin, P; Dufau, S; Mathôt, S; Grainger, J. MEGALEX: A megastudy of visual and auditory word recognition. Behavior Research Methods; 2018; 50,
Gobin, P; Camblats, A-M; Faurous, W; Mathey, S. Une base de l’émotionalité (valence, arousal, catégories) de 1286 mots français selon l’âge (EMA). Revue Européenne de Psychologie Appliquée; 2017; 67,
Grandy, TH; Lindenberger, U; Schmiedek, F. Vampires and nurses are rated differently by younger and older adults—Age-comparative norms of imageability and emotionality for about 2500 German nouns. Behavior Research Methods; 2020; 52,
Hoffman, P. (2018). Divergent effects of healthy ageing on semantic knowledge and control: Evidence from novel comparisons with semantically impaired patients. Journal of Neuropsychology, 0(0), 0. https://doi.org/10.1111/jnp.12159
Hoffman, P; Lambon Ralph, MA; Rogers, TT. Semantic diversity: A measure of semantic ambiguity based on variability in the contextual usage of words. Behavior Research Methods; 2013; 45,
Hommel, B., & Kibele, A. (2016). Down with Retirement: Implications of Embodied Cognition for Healthy Aging. Frontiers in Psychology, 7. https://doi.org/10.3389/fpsyg.2016.01184
Huete-Pérez, D; Haro, J; Fraga, I; Ferré, P. HEROÍNA: Drug or hero? Meaning-dependent valence norms for ambiguous Spanish words. Applied Psycholinguistics; 2020; 41,
Ibanez, A., Kühne, K., Miklashevsky, A., Monaco, E., Muraki, E., Ranzini, M., Speed, L., & Tuena, C. (2022). Ecological meanings: A consensus paper on individual differences and contextual influences in embodied language. https://doi.org/10.31219/osf.io/ej5y3
Keuleers, E; Stevens, M; Mandera, P; Brysbaert, M. Word knowledge in the crowd: Measuring vocabulary size and word prevalence in a massive online experiment. Quarterly Journal of Experimental Psychology; 2015; 68,
Khanna, M. M., & Cortese, M. J. (2021). How well imageability, concreteness, perceptual strength, and action strength predict recognition memory, lexical decision, and reading aloud performance. Memory (Hove, England), 1–15. https://doi.org/10.1080/09658211.2021.1924789
Kidd, E; Donnelly, S; Christiansen, MH. Individual differences in language acquisition and processing. Trends in Cognitive Sciences; 2018; 22,
Krethlow, G; Fargier, R; Laganaro, M. Age-Specific Effects of Lexical-Semantic Networks on Word Production. Cognitive Science; 2020; 44,
Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5). https://doi.org/10.18637/jss.v028.i05
Kuperman, V; Stadthagen-Gonzalez, H; Brysbaert, M. Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods; 2012; 44,
Kuperman, V; Estes, Z; Brysbaert, M; Warriner, AB. Emotion and language: Valence and arousal affect word recognition. Journal of Experimental Psychology: General; 2014; 143,
Lau, MC; Goh, WD; Yap, MJ. An item-level analysis of lexical-semantic effects in free recall and recognition memory using the megastudy approach. Quarterly Journal of Experimental Psychology; 2018; 71,
MacCormack, JK; Henry, TR; Davis, BM; Oosterwijk, S; Lindquist, KA. Aging bodies, aging emotions: Interoceptive differences in emotion representations and self-reports across adulthood. Emotion; 2021; 21,
Meteyard, L; Vigliocco, G. Rueschemeyer, S-A; Gaskell, MG. Lexico-Semantics. The Oxford Handbook of Psycholinguistics; 2018; Oxford University Press: [DOI: https://dx.doi.org/10.1093/oxfordhb/9780198786825.013.4]
New, B; Pallier, C; Brysbaert, M; Ferrand, L. Lexique 2: A new French lexical database. Behavior Research Methods, Instruments, & Computers; 2004; 36,
Paisios, D; Huet, N; Labeye, E. Addressing the elephant in the middle: Implications of the midscale disagreement problem through the lens of body-object interaction ratings. Collabra: Psychology; 2023; 9,
Paivio, A; Yuille, JC; Madigan, SA. Concreteness, imagery, and meaningfulness values for 925 nouns. Journal of Experimental Psychology; 1968; 76,
Ploetz, DM; Yates, M. Age of acquisition and imageability: A cross-task comparison. Journal of Research in Reading; 2016; 39,
Pollock, L. Statistical and methodological problems with concreteness and other semantic variables: A list memory experiment case study. Behavior Research Methods; 2018; 50,
Posit Team. (2023). R Studio (2023.3.1.446) [Computer software]. https://www.posit.co/
R Core Team. R: A language and environment for statistical computing; 2020; Foundation for Statistical Computing: https://www.R-project.org/
Reilly, J; Kean, J. Formal Distinctiveness of High- and Low-Imageability Nouns: Analyses and Theoretical Implications. Cognitive Science; 2007; 31,
Rodd, J. M. (2020). Settling into semantic space: An ambiguity-focused account of word-meaning access. Perspectives on Psychological Science, 15(2), 411–427. https://doi.org/10.1177/1745691619885860
Rofes, A., Zakariás, L., Ceder, K., Lind, M., Johansson, M. B., de Aguiar, V., Bjekić, J., Fyndanis, V., Gavarró, A., Simonsen, H. G., Sacristán, C. H., Kambanaros, M., Kraljević, J. K., Martínez-Ferreiro, S., Mavis, İ., Orellana, C. M., Sör, I., Lukács, Á., Tunçer, M., …, & Howard, D. (2018). Imageability ratings across languages. Behavior Research Methods, 50(3), 1187–1197. https://doi.org/10.3758/s13428-017-0936-0
Rojas, C; Riffo, B; Guerra, E. Visual word recognition among oldest old people: The effect of age and cognitive load. Frontiers in Aging Neuroscience; 2022; 14, 1007048. [DOI: https://dx.doi.org/10.3389/fnagi.2022.1007048] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36247989][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9561928]
Schock, J; Cortese, MJ; Khanna, MM. Imageability estimates for 3,000 disyllabic words. Behavior Research Methods; 2012; 44,
Shafto, M; Randall, B; Stamatakis, EA; Wright, P; Tyler, LK. Age-related neural reorganization during spoken word recognition: The interaction of form and meaning. Journal of Cognitive Neuroscience; 2012; 24,
Siakaluk, P. D., Newcombe, P. I., Duffels, B., Li, E., Sidhu, D. M., Yap, M. J., & Pexman, P. M. (2016). Effects of emotional experience in lexical decision. Frontiers in Psychology, 7. https://doi.org/10.3389/fpsyg.2016.01157
Simonsen, HG; Lind, M; Hansen, P; Holm, E; Mevik, B-H. Imageability of Norwegian nouns, verbs and adjectives in a cross-linguistic perspective. Clinical Linguistics & Phonetics; 2013; 27,
Stoet, G. PsyToolkit: A software package for programming psychological experiments using Linux. Behavior Research Methods; 2010; 42,
Stoet, G. PsyToolkit: A novel web-based method for running online questionnaires and reaction-time experiments. Teaching of Psychology; 2017; 44,
Su, Y., Li, Y., & Li, H. (2022). Imageability ratings for 10,426 Chinese two-character words and their contribution to lexical processing. Current Psychology.https://doi.org/10.1007/s12144-022-03404-4
Vergallito, A; Petilli, MA; Marelli, M. Perceptual modality norms for 1,121 Italian words: A comparison with concreteness and imageability scores and an analysis of their impact in word processing tasks. Behavior Research Methods; 2020; 52,
Westbury, C. You can’t drink a word: Lexical and individual emotionality affect subjective familiarity judgments. Journal of Psycholinguistic Research; 2014; 43,
Westbury, C. F., Shaoul, C., Hollis, G., Smithson, L., Briesemeister, B. B., Hofmann, M. J., & Jacobs, A. M. (2013). Now you see it, now you don’t: On emotion, context, and the algorithmic prediction of human imageability judgments. Frontiers in Psychology, 4(DEC). https://doi.org/10.3389/fpsyg.2013.00991
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T., Miller, E., Bache, S., Müller, K., Ooms, J., Robinson, D., Seidel, D., Spinu, V., …, & Yutani, H. (2019). Welcome to the Tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Winter, B. Berez-Kroeker, AL; McDonnell, B; Koller, E; Collister, LB. Managing semantic norms for cognitive linguistics, corpus linguistics, and lexicon studies. The Open Handbook of Linguistic Data Management; 2022; The MIT Press: [DOI: https://dx.doi.org/10.7551/mitpress/12200.001.0001]
Wulff, D. U., Hills, T. T., Lachman, M., & Mata, R. (2016). The Aging Lexicon: Differences in the Semantic Networks of Younger and Older Adults. https://dwulff.github.io/Papers/WulffEtAl2016AgingLexicon_final.pdf
Wulff, DU; Hills, TT; Mata, R. Structural differences in the semantic networks of younger and older adults. Scientific Reports; 2022; 12,
Yap, MJ; Balota, DA; Sibley, DE; Ratcliff, R. Individual differences in visual word recognition: Insights from the English Lexicon Project. Journal of Experimental Psychology: Human Perception and Performance; 2012; 38,
Yap, M. J., Pexman, P. M., Wellsby, M., Hargreaves, I. S., & Huff, M. J. (2012b). An abundance of riches: Cross-Task comparisons of semantic richness effects in visual word recognition. Frontiers in Human Neuroscience, 6(APRIL 2012). https://doi.org/10.3389/fnhum.2012.00072
© The Psychonomic Society, Inc. 2024.