Moving beyond word frequency based on tally counting: AI-generated familiarity estimates of words and phrases are an interesting additional index of language knowledge

Abstract

This study investigates the potential of large language models (LLMs) to estimate the familiarity of words and multi-word expressions (MWEs). We validated LLM estimates for isolated words using existing human familiarity ratings and found strong correlations. LLM familiarity estimates performed even better in predicting lexical decision and naming performance in megastudies than the best available word frequency measures. We then applied LLM estimates to MWEs, also finding their effectiveness in measuring familiarity for these expressions. We have created a list of more than 400,000 English words and MWEs with LLM-generated familiarity estimates, which we hope will be a valuable resource for researchers. There is also a cleaned-up list of nearly 150,000 entries, excluding lesser-known stimuli, to streamline stimulus selection. Our findings highlight the advantages of LLM-based familiarity estimates, including their better performance than traditional word frequency measures (particularly for predicting word recognition accuracy), their ability to generalize to MWEs, availability for large lists of words, and ease of obtaining new estimates for all types of stimuli.

Full text

Translate

Turn on search term navigation

Word frequency

Frequently occurring words are easier to process than words that rarely occur in the language. This word frequency effect was first demonstrated in a largely overlooked article by Preston (1935). She examined the rate of word perception by measuring the time between seeing a stimulus word and naming it. The stimuli consisted of 50 familiar and 50 unfamiliar six-letter two-syllable words. The selection was based on Thorndike’s (1931) list of word frequencies of 20,000 words, which had been published a few years earlier. The familiar words were chosen from the 1500 most frequent words in Thorndike’s list; the unfamiliar words were taken from the bottom 2000 words. The average perception time for the familiar words in Preston (1935) was 578 ms; for the unfamiliar words it was 691 ms, a difference of 113 ms. Unfortunately, no examples of the familiar and unfamiliar words were given.

The word frequency effect is a cornerstone of psycholinguistic word research (Brysbaert et al., 2011a, 2018). The effect is an interesting research topic on its own and it must also be taken into account when another topic is studied. Then word frequencies are expected to be matched between the studied conditions. Gradually, however, authors began to point out that not all word frequency estimates are of the same quality (Balota et al., 2007; Brysbaert & New, 2009; Burgess & Livesay, 1998). Two variables are important: (1) the size of the corpus on which the frequency norms are based, and (2) the language register of the corpus.

Regarding the corpus size, Brysbaert and New (2009) recommended a minimum of 16 million words. Otherwise, frequencies of rare words are badly estimated. Brysbaert and New also pointed out the importance of language register because word frequencies based on movie subtitles correlated more with word processing efficiency than the existing word frequency measures based on books and articles (which consisted largely of nonfiction text). The word processing efficiency data came from the English Lexicon Project (Balota et al., 2007). They comprised word naming accuracy and latency and lexical decision accuracy and latency (does the string of letters shown form an existing word or not?).

The importance of language register was further confirmed when Brysbaert et al. (2011b) compared the subtitle word frequencies to frequency estimates based on a Google book corpus of 131 billion words. Despite the huge size difference, the subtitle frequencies predicted word processing efficiency better than the Google frequencies, arguably because subtitles are more representative of the type of language undergraduate students (participants of the English Lexicon Project) have been exposed to.

Other authors argued that the correlation between word frequency and word processing efficiency can be increased by taking into account the diversity of contexts in which the words occur. A raw count of the number of times a word occurs in a corpus does not distinguish between a frequency number derived from a word that is repeatedly used in one particular source (text, film) and a word that occurs occasionally in different sources. Several measures have been proposed to correct for the degree of burstiness or dispersion in word occurrence (Adelman et al., 2006; Carroll, 1970; Gries, 2008). One of the most recent corrections was proposed by Johns and colleagues (Chang et al., 2023; Johns & Jones, 2022). They argued that a semantic distinction measure (UCD-SD) explained significantly more variance than raw word frequency counts, as we will see below. The semantic distinction measure corrects word frequency for the semantic similarity of contexts in which the words occur.

Word familiarity

In parallel with attempts to count words in corpora and correct them for similarities in context, some researchers argued that a better approach was to ask participants directly how familiar they are with words. Haagen (1949), for example, asked students to rate their familiarity with 480 words on a five-point Likert scale. Familiarity was defined in terms of how familiar the meaning and usage of the words were to the assessor. The criterion for “greatest familiarity” was immediate recognition and absolute certainty of the meaning and use of the word. The lowest degree of familiarity was associated with words that were not recognized. For instance, the word “foolish” was rated close to 5, while the word “insane” was rated close to 3. Cortese and Khanna (2007) reported that both word frequency and word familiarity predict variance in word processing times (naming and lexical decision). An even stronger conclusion was drawn by Gernsbacher (1984), who stated that word familiarity is a better measure of word processing efficiency than word frequency, especially for less familiar words (see also Chen & Dong, 2019).

Several attempts have been made to compile word familiarity datasets (see list below). However, these have never been as influential as word frequency databases. The main reason was that it takes more effort to collect familiarity ratings than word frequency counts. As a result, familiarity norms were at best limited to a few thousand words, while frequency measurements were available for tens of thousands of words. A second reason was that Brysbaert and Cortese (2011) noted that the advantage of word familiarity ratings over word frequency counts diminished as word frequency measures improved. They concluded that the correlation between word familiarity and word processing efficiency could be explained by the joint effects of word frequency and the age at which a word is learned (age of acquisition).

Some authors also distrusted assessments of word familiarity because of their subjective nature (e.g., Baayen et al., 2016). A group of participants (mostly students from one university) were asked to provide ratings and could use whatever information they thought was useful. At worst, the correlation between word familiarity and word processing efficiency was not due to a causal relationship from familiarity to efficiency, but the other way around, with students assuming that easy-to-process words “must” be familiar, while hard-to-process words “must” be unfamiliar. If such reasoning is involved, it is difficult to argue that word familiarity ratings say much about the factors underlying word recognition. Westbury (2014) further argued that familiarity ratings are unfairly influenced by students’ interests, which he summarized as clothing, food and drink, money, relationships (including sex), and time (see also Balota et al., 2001).

Because of the doubts about the interpretation of familiarity ratings, they have not been included in recent attempts to collect word information for large numbers of words online. Although collecting such standards has become feasible, little return on investment was expected.

We became interested in familiarity assessments again when we tried to obtain estimates of processing efficiency for multi-word expressions such as “doom loop,” “on the go,” or “betwixt and between.” Collecting frequency counts for these stimuli is much more challenging than frequency counts for individual words because expressions can take many forms (e.g., the different forms of “washing oneself”) and words can be added, omitted, or changed (e.g., in the expression “everything comes to him who waits”). Collecting familiarity ratings seemed safer than trying to get valid frequency counts.

Large language models (LLMs) are neural networks that learn to predict the next word based on the preceding sequence of words. They can be tuned to become conversational models that can be queried in a human-like manner. The best-known example of a conversation-based LLM program is ChatGPT, which made the program available to a wide audience. The fact that conversational large language models can be queried for features of words opens the possibility of massive data collection and addresses the concern that biases in a small group of participants can affect ratings. Before we investigate how good LLM-based familiarity estimates are, we discuss the existing evidence that LLM models can provide useful information about words.

Existing evidence for the usefulness of LLM-based estimates of word characteristics

Trott (2024a, b) has done the most extensive research on LLM-generated data as an alternative to human ratings. Trott (2024a) collected GPT-4 estimates (https://openai.com/; see also Wu et al., 2023) for 26 sets of human ratings ranging from word iconicity to gender-relatedness of words. He observed positive correlations for each measure, ranging from .39 (for semantic dominance) to .86 (for semantic similarity between two words). The correlations between the GPT-4 estimates and the average human ratings were almost always greater than the correlations between individual human raters and the group data, leading Trott (2024a) to conclude that the GPT-4 estimates can be considered an average of a group of people (wisdom of the crowd).

The line of research was continued in Trott (2024b), who estimated the number of people needed to beat GPT-4 for four word features: judgments about concreteness of ambiguous English words in context (toast in the context of bread or speech), judgments about the valence of ambiguous English words in context (arm as limb or weapon), judgments about the relatedness of ambiguous English words in context (cast as plaster cast or actors), and judgments about the iconicity of English words (wiggle vs. partial; Winter et al., 2024). The numbers of participants needed to beat GPT-4 were: 3 for concreteness, 2 for semantic relatedness, 11 for valence, and 8 for iconicity.

At the same time, Trott (2024a) warned not to deify AI-generated word features. Large language models such as ChatGPT are easy to use but not transparent. In particular, Trott cautioned against leakage and contamination. These terms refer to an overlap between the training set and the test set. Since GPT-4 was trained on billions of texts/databases, it cannot be ruled out that the original human datasets were part of the training material. If so, the good performance of GPT-4 may be limited to the words present in the datasets and may not generalize (well) to new words for which human data are not yet available. Although Trott (2024a, b) found no evidence of leakage and contamination in the datasets he analyzed, he argued that it is a concern that must not be forgotten.

Similar cautionary messages were expressed by other authors. Dillion et al. (2023) warned that language models will never completely replace human participants, even when strongly correlated with human judgments, but that the estimates may be useful in pilot testing, idea generation, and fine-tuning. Atari et al. (2023) argued that training materials are biased to the Western world. Grieve et al. (2024) cautioned that each language model inherently models the language variety represented by the corpus on which it has been trained. Messeri and Crockett (2024) noted that AI solutions can exploit our cognitive limitations, leaving us vulnerable to illusions of understanding in which we think we understand more of the world than we actually do. Such illusions can even create scientific monocultures in which certain types of methods, questions and viewpoints come to dominate alternative approaches, making science less innovative and more vulnerable to error.

How well do LLM familiarity estimates approach human familiarity ratings?

Compared to the existing use of LLM models as a substitute for human judgment, our ambitions were rather limited. We wanted to know whether an LLM model would be able to provide a good alternative to human familiarity ratings, similar to the question addressed by Trott (2024a, b). We were particularly interested in its potential to provide familiarity estimates for expressions consisting of multiple words, since it is not clear how to compute good frequency norms for such expressions. Our hope was that LLM-generated familiarity estimates would provide a good enough approximation to human ratings, so that the values can be used for stimulus selection and stimulus control. This would make it possible to reduce the number of stimuli presented to people for assessment.

There were reasons to be optimistic about the performance of LLM in relation to familiarity estimates, given that the information is fully present in the training materials (as opposed to judgments that refer to the world outside written texts, such as moral judgments or embodied cognition), at least if the LLM model is able to leave aside the huge corpus on which it was trained (much larger than a human can ever dream of achieving).

The first step in using LLM-generated familiarity estimates is to validate them. The most direct way to do this is to correlate them with human ratings, as was done by Trott (2024a; see also Martínez et al., 2025). As far as we know, there are no human familiarity ratings for multi-word expressions yet, but there are five large datasets for individual words. In chronological order, they are:

MRC familiarity ratings (Coltheart, 1981): This is a collection of three previous rating studies (Pavio et al., 1968; Gilhooly & Logie, 1980; an unpublished dataset from Paivio, later published as Clark & Paivio, 2004). This dataset contains Likert ratings on a seven-point scale for 4920 English words.
Hoosier norms (Nusbaum et al., 1984): This report contains familiarity ratings of American students on a seven-point Likert scale for 19,320 English words. Was long an unpublished report, but the data have recently become available in a number of databases (e.g., Vaden et al., 2009; Gao et al., 2023).
Bristol norms (Stadthagen-Gonzalez & Davis, 2006): Contains ratings on a seven-point Likert scale for 1450 words other than Gilhooly and Logie (1980).
Compound words (Juhasz et al., 2015): Contains ratings on a seven-point Likert scale for 629 English bi-lexeme compound nouns.
Glasgow norms (Scott et al., 2019): Contains ratings on a seven-point scale for 5553 English words.

Before looking at correlations with LLM ratings, it is good to examine how the ratings correlate across datasets. This gives an idea of the correlations that can be expected. Table 1 shows the correlations. Because of the different numbers of word ratings in the different surveys and the different degrees of overlap between datasets, the correlations are based on different numbers (also included in Table 1).

Table 1 . Pearson correlations between the human ratings of familiarity (N = the number of observations on which the correlation is based)

	Hoosier N = 19,320	Bristol N = 1450	Compound N = 629	Glasgow N = 5553
MRC	.60 N = 3894	.70 N = 625	.64 N = 47	.80 N = 2540
Hoosier		.46 N = 1236	.67 N = 242	.47 N = 3598
Bristol			.35 N = 12	.83 N = 1397
Compound				.75 N = 61

A look at Table 1 shows that the Hoosier norms differ from the Bristol and Glasgow norms, and also to some extent from the MRC norms (the Compound norms are too small in number of overlapping words to draw strong conclusions). This may indicate regional differences in the ratings (the MRC assessments were collected primarily in Canada). Another reason is that the Hoosier ratings used different instructions. Whereas the other databases used a form of familiarity assessment (see below for the Glasgow instructions), the Hoosier ratings asked more about word knowledge. The instructions were described as follows: “A seven‐point rating scale was used in which a rating of one indicated that the word was unknown, a rating of seven indicated that the word was familiar and its meaning was well known. A rating of four indicated that the stimulus was definitely recognized as a word, but its meaning was unknown. The ratings two, three, five, and six were used to indicate variations between these response categories. For example, a rating of three indicated that the subject might have seen the word before, while a rating of five indicated that the subject recognized the word, but had only the vaguest notion of its meaning.”

To obtain LLM familiarity estimates, we used the following prompt inspired by the instructions of Scott et al. (2019) for the Glasgow norms.1 Estimates are based on ChatGPT4 (version GPT-4o-2024-08-06), as this LLM gave the best estimates in a previous study we ran (Martínez et al., 2025):

“Familiarity is a measure of how familiar something is. A word is very FAMILIAR if you see/hear it often and it is easily recognizable. In contrast, a word is very UNFAMILIAR if you rarely see/hear it and it is relatively unrecognizable. Please indicate how familiar you think each word is on a scale from 1 (VERY UNFAMILIAR) to 7 (VERY FAMILIAR), with the midpoint representing moderate familiarity. The word is: [insert word here]. Only answer a number from 1 to 7. Please limit your answer to numbers.”

The instruction was repeated for each word to avoid dilution across trials. We used Python programs developed by Martínez et al. (2023) to automatize the queries via an Application Programming Interface (API). GPT can be asked to give the probabilities of the five most likely answers with the command logprobs (Hills & Anadkat, 2023). This was particularly interesting for our measure (a 1–7 rating scale), which consisted of a small set of numerical answers. From the logprobs, we derived two outputs: (1) the estimate with the highest probability, and (2) a mean value based of the probabilities of the various alternatives. The former equals the output of the model with temperature = 0 (always give the same, most likely answer). The latter makes calculating a more precise estimate possible by taking the sum of the answer alternatives times their probabilities. For instance, if the probabilities for a word were .237 for answer 2, .561 for answer 3, and .202 for answer 4, the end estimate would be .237*2 + .561*3 + .202*4 = 2.965. We used the more precise estimate in all analyses below.

Table 2 shows the correlations between between the various sets of human ratings and the GPT estimates. The first column shows the Pearson correlations, the second column shows correlation coefficients based on non-linear regression. They are based on the best-fitting curve when restricted cubic splines are used with four knots (Harrell, 2024).

Table 2 . Correlations between the human ratings and ChatGPT4 estimates. First coefficient = Pearson correlation; second coefficient = correlations on the basis of nonlinear regression

Dataset	Pearson	Nonlinear
MRC (N = 4,913)	.86	.87
Hoosier (N = 19,165)	.79	.83
Bristol (N = 1450)	.66	.72
Compound (N = 629)	.81	.83
Glasgow (N = 5546)	.76	.77

The correlations of the human ratings with the GPT estimates were similar to the correlations between the human ratings themselves. This was not expected beforehand since LLMs are trained on many more language materials than undergraduates. Neural networks like LLMs are extremely good at picking up statistical relationships between inputs and outputs, but the network’s outcome approaches that of humans only if the training material is the same. If the GPT training was based exclusively on material from Google Books or on Wikipedia, we would expect the estimates to be less correlated with human assessments.

To further examine the relationships between LLM estimates and human ratings, we ran a multiple regression on the Glasgow data. Other predictors were SUBTLEX-UK word frequency, Multilex word frequency, age of acquisition (AoA) and word prevalence. SUBTLEX-UK is a subtitle-based frequency measure based on BBC programs for a total corpus of 201 million words (van Heuven et al., 2014). Multilex is a new frequency measure obtained by combining the Twitter, Blog, and News frequencies of Gimenes and New (2016) with the subtitle frequencies of van Paridon and Thompson (2021). Combining convergent corpora usually leads to a better frequency measure than using a single source (Brysbaert & New, 2009). The total corpus size was 36 million words (Blogs) + 28 million (Twitter) + 32 million (News) + 751 million (Subtitles) = 847 million words. We simply summed the frequencies and recalculated them to Zipf-values as was done in SUBTLEX-UK (van Heuven et al., 2014). The Zipf scale is a standardized logarithmic scale defined as log10 frequency per billion words (with 1 representing words that only occur once per billion words and 7 representing words that occur over 10 thousand times per million words). Because the subtitle corpus is larger than the other three corpora, the Multilex frequencies are strongly dominated by spoken language. This is a good thing, as subtitle frequencies are the best single source for word frequency estimates. For the same reason, the 2-billion-word Wikipedia corpus of van Paridon and Thompson was not included because it reduced the correlations with word processing times. The AoA measure came from the same Glasgow norms as the familiarity ratings (Scott et al., 2019) and indicated the age at which people believe they acquired the words. Word prevalence came from Brysbaert et al. (2019) and indicates how many people say they know the word in a yes/no vocabulary test (probit transformed). Nonlinear regression was used to allow for different types of monotonic curves. Only words for which all observations were available were included in the analysis (N = 4515). Table 3 first shows the correlations between the various variables.

Table 3 . Pearson correlations between the variables in the regression analysis (FAM = the Glasgow familiarity ratings)

	SUBTLEX_UK	Mulitlex	AoA	Prevalence	GPT_Fam
FAM	.68	.67	– .67	.52	.80
SUBTLEX-UK		.92	– .62	.44	.67
Multilex			– .60	.48	.69
AoA				– .34	– .57
Prevalence					.65

Multiple regression analysis indicated that GPT familiarity estimates were influenced not only by human familiarity ratings but also by word frequency (low-frequency words received lower GPT estimates; Multilex was used because it yielded higher values than SUBTLEX) and word prevalence (words known to many people had higher GPT estimates). AoA had no significant effect once the other variables were taken into account.

The impacts of word frequency and word prevalence are particularly interesting because they suggest that GPT estimates of familiarity may be better estimates of word difficulty than human ratings. Indeed, word frequency and prevalence are strong predictors of word processing efficiency. The usefulness of GPT estimates of familiarity for predicting word processing efficiency is examined in the next section.

How well do LLM familiarity estimates predict word processing efficiency?

Another way to examine the validity of LLM-based familiarity estimates is to see how well they predict word processing efficiency. In particular, we want to know whether the familiarity estimates correlate as well with word processing indices as the available word frequency measures.

There are four large-scale English datasets available to answer the question. The first dataset is the English Lexicon Project (ELP; Balota et al., 2007), already referred to. It provides lexical decision and naming data for 39,315 words. Both tasks provide measures of accuracy (% correct) and speed (in milliseconds). The authors recommend using the z-transformed RTs. Participants were American university students.

The second dataset is the British Lexicon Project (BLP; Keuleers et al., 2012). It contains lexical decision data (accuracy and speed) for 28,593 monosyllabic and disyllabic words. Participants were British university students.

The third dataset is the English Crowdsourcing Project (ECP; Mandera et al., 2020). It contains lexical decision data (accuracy and speed) for 61,851 words from an online vocabulary test taken by 220,000 people from the entire English-speaking population.

The fourth dataset is the English Auditory Lexicon Project (AELP; Goh et al., 2020). It contains lexical decision data for 10,133 words that were presented in spoken form (in the previous databases, the words were presented visually on a computer screen). Each word was judged by three groups of students: from Singapore, the United States and the United Kingdom. Because there were no major differences between the groups, we discuss only the average data.2

For each dependent variable, we had four predictors of interest: (1) SUBTLEX word frequency (UK for the British Lexicon Project, US for the other datasets), (2) Multilex word frequency, (3) UCD-SD frequencies, and (4) GPT-Fam estimates. The UCD-SD frequencies are from Chang et al. (2023). The UCD-SD measure stands for user contextual diversity modified by the semantic distinctiveness model. It is based on the number of Reddit entries in which a word appears and is considered by Chang et al. (2023) to be currently the best estimate of burstiness-adjusted word frequency.

Note that the sizes of the databases are a way to address concerns that the high correlation between GPT-Fam estimates and human ratings are a result of leakage and contamination (Trott, 2024a). If a correlation is due to the fact that human ratings were part of the training data, then we would expect GPT4 to no longer be good for words that were not rated by humans. We would see an overall decrease in the usefulness of the variable and a difference between the words rated by humans and the other words.

We report the percentage of variance explained by each predictor in a nonlinear regression (restricted cubic splines with four knots). For RT analyses, only words with more than 85% recognition were included because RTs are misleading for words not recognized by most participants. The results of the different analyses are shown in Table 4.

Table 4 . Percentage of variance explained by the various predictors in a nonlinear regression analysis. The best predictor is indicated in bold

	SUBTLEX	Multilex	UCD-SD	GPT_Fam
ELP LDT accuracy	.21	.25	.32	.47
ELP LDT speed (zRT)	.43	.47	.42	.33
ELP nam accuracy	.15	.18	.21	.47
ELP nam speed (zRT)	.27	.31	.30	.23
BLP LDT accuracy	.29	.29	.44	.48
BLP LDT speed (RT)	.39	.40	.38	.44
ECP LDT accuracy	.32	.31	.34	.67
ECP LDT speed (RT)	.55	.58	.47	.57
AELP LDT accuracy	.12	.16	.14	.34
AELP LDT speed (RT)	.26	.28	.22	.18

A first notable finding in Table 4 is that GPT familiarity estimates are much better than existing frequency measures regarding accuracy prediction. If we want to predict whether or not a word will be known, GPT familiarity information explains some 20% more variance than the best word frequency measure.

The situation is less convincing for response times. In three of the databases, Multilex frequency outperforms the others (including UCD-SD). GPT is only slightly better in the British Lexicon Project. One reason for the weak performance of GPT_Fam could be that other variables play a moderating role. For example, word length is known to have a significant influence on auditory word recognition. Word length also had a strong influence on the ELP data, because the nonwords became more like words as length increased (they differed from existing words by only one letter, independent of the length of the nonword).

To see if word length could explain the differences in Table 4, we repeated the RT analysis with word length as an additional nonlinear variable (restricted cubic splines with four nodes). For the visual tasks, word length was measured as the number of letters in the word; for AELP, word length was defined as the number of phonemes in the word.

As can be seen in Table 5, word length is indeed an important variable to control for in lexical decision times, as the percentage of variance explained increases substantially (up to 30%). Important for the present discussion, when the influence of word length is taken into account, the differences between Multilex and GPT_Fam become much smaller and are even slightly in favor of the GPT familiarity estimates. The lower correlations between RTs and GPT familiarity in Table 4 are due to the fact that GPT familiarity estimates correlate less with word length than word frequency measures. When words are matched by length, the GPT familiarity estimate is slightly more predictive of processing speed than the word frequency measures.

Table 5 . Percentage of variance explained by the various predictors in a nonlinear regression analysis when word length is taken into account as well. The best predictor is indicated in bold

	SUBTLEX	Multilex	UCD-SD	GPT_Fam
ELP LDT speed (zRT)	.63	.67	.67	.68
ELP nam speed (zRT)	.43	.46	.48	.49
BLP LDT speed (RT)	.43	.43	.45	.46
ECP LDT speed (RT)	.65	.65	.56	.72
AELP LDT speed (RT)	.46	.48	.45	.47

The good performance of GPT_Fam in all databases assures that AI-generated estimates are not simply a repetition of information that was present in the training material but generalizes to new, unseen stimuli. Analysis of multi-word expressions will provide more information on this.

Figures 1 and 2 show the progress from the SUBTLEX frequency to the GPT familiarity estimate for the English Crowdsourcing Study (admittedly the study with the most progress). Whereas the predicted accuracy goes from .62 to .99 for SUBTLEX, it goes from .25 to .99 for GPT (Fig. 1). For RTs, SUBTLEX only makes a distinction between words at the low end, whereas GPT distinguishes words across the entire range (Fig. 2).

[See PDF for image]

Fig. 1

Progress in predictive power for response accuracy from SUBTLEX word frequency to GPT familiarity estimate (ECP dataset). Figures made with the visreg package (Breheny & Burchett, 2020)

[See PDF for image]

Fig. 2

Progress in predictive power for response speed from SUBTLEX word frequency to GPT familiarity estimate (ECP dataset)

Why are GPT familiarity estimates better than word frequency counts?

As indicated earlier, we did not expect GPT word familiarity estimates to outperform existing word frequency measures in predicting word processing performance. The main reason for this was that the training material to which GPT was exposed is much richer than the material a student (the participant in most studies) has faced. This raises the question of why GPT is doing so well. We hope that many colleagues will help elucidate the underlying factors, but at this time we are considering the following speculations.

This is not the first time a neural network outperforms an earlier counting model. The same has happened in semantic vectors, where vectors based on networks outperform vectors based on counting co-occurrences (Mandera et al., 2017). Similarly, network models do better at predicting the next word in a sentence than Ngram models based on counting the previous few words (Frank, 2009; Xu & Rudnicky, 2000).

The main difference between counting and a neural network is that a neural network takes into account not only the number of occurrences of individual items but also synergies and competitions with other items. In a large language model, synergies are similarities in form and context. Words that are similar in form and occur in the same contexts (e.g., inflected words and transparent derived words) promote each other’s training in the network. In contrast, words with inconsistencies between form and context (e.g., homographs and opaque derived words) hinder each other’s learning.

An example of synergy was reported by Schreuder and Baayen (1997). They observed that word processing times of monomorphemic words are faster when the words have many derived forms than when they have few, an effect the authors called the family size effect. Thus, a word like “berry” is recognized faster than the length- and frequency-matched word “skate” because the family size of “berry” (baneberry, barberry, berrylike, blackberry, ...) is much larger than the family size of “skate” (cheapskate, ice-skate, roller-skate, skateboard, skater). Further interesting is that the number of derivations seems to be more important than the frequencies of the derivations (i.e., type frequency instead of token frequency; Bertram et al., 2000).

The effect of family size on lexical decision time can be seen in the left part of Fig. 3. It shows the processing flow of monomorphemic words as a function of their frequency and family size. The words were the monomorphemic words and their family sizes listed in Brysbaert et al. (2016). The processing flow was based on the ECP database and defined as 6000 divided by lexical decision time. Its meaning is close to “bits processed per second.” The numerator 6000 was chosen because the scale then resembles the scale of the GPT familiarity estimates. Word frequency was the Multilex frequency expressed in Zipf scores. Family size was defined as log10 of the number of family members. Thus, a value of 0 means that there was only one family member (the word itself); a value of 1 indicates that there were ten family members.

[See PDF for image]

Fig. 3

The left part shows the information processing flow of monomorphemic words (N = 15,363) as a function of their Multilex frequency and family size. It shows that the processing flow is higher for high-frequency words than for low-frequency words (Y-axis). At the same time, there is a smaller effect of family size (X-axis), in particular for low-frequency words. Processing flow improves as family size becomes larger up to a certain point (around 1.3, corresponding to a family size of 20) and then decreases again for higher family sizes. The right part of the figure shows that exactly the same pattern is present in the GPT familiarity estimates. The figure and analysis are based on the R package mcgv (Wood, 2023)

The left panel of Fig. 3 shows the strong effect of Multilex word frequency on information processing flow: More information per unit time is processed for high-frequency words than for low-frequency words. At the same time, there is a smaller effect of family size, especially for words with Multilex frequencies below 4 (corresponding to an occurrence of 10 times per million words). However, the pattern is somewhat more complex than Schreuder and Baayen (1997) thought. Initially, family size accelerates the processing flow, but this stops around family size 1.3 (corresponding to 20 family members). For low-frequency words with even higher family sizes, family members hurt rather than help (bottom right corner of the left panel). The two main examples of such words are “wort” and “berry,” which rarely occur as base words but often occur in compounds (butterwort, feverwort, milkwort, ..., blackberry, strawberry, ...).

Importantly, much the same pattern is seen in the GPT familiarity estimates (right side of Fig. 3). Where processing flow increases, GPT familiarity estimates increase; where processing flow is low, GPT familiarity is also low. This can be understood within the dynamics of an LLM. In such neural networks, words support each other when there is consistency in both form and context overlap, and compete with each other when there are inconsistencies in the mappings. The interactions are particularly important for low-frequency words, which are dominated by the higher-frequency ones. Bertram et al. (2000) observed the same network dynamics in human lexical decision times: Family members were especially helpful if they were transparent and similar in meaning to the stem word. Because network dynamics are not present in frequency counts, we believe this is one of the reasons why word frequency counts are at a disadvantage over GPT familiarity estimates to predict word processing times.

Another example of inconsistency between form and context is found with fossil words. These are words that are obsolete but still observed in a fixed expression, such as “ado” in “without further ado” or “amok” in “run amok”. Fossil words by definition have a frequency equal to or greater than the frequency of the multi-word expressions they are part of, but have long response times in lexical decisions because participants do not expect to see these words outside their usual context. We took a list of 35 fossil words from Wikipedia (in August 2024) and compared the GPT familiarity estimates for these words to the GPT familiarity estimates for the full expression. In line with people, GPT estimates of the individual words are lower (M = 3.03) than the estimates of the full expression (M = 3.71; t(34) = – 3.94, d = – .67). This is another example where LLM familiarity estimates take into account the broader context and therefore outperform word frequency measures.

More generally, we can predict that word familiarity will be lower for all words with inconsistencies than for matched control words, because diverse contexts are an advantage only if they relate to the same meaning of a word. This differs from the context-dependent frequency approach (Chang et al., 2023; Johns & Jones, 2022), which assumes that context variety always leads to a processing advantage.

Another example where GPT gives a more sensible estimate than a simple frequency count is the word “creasy.” Brysbaert and New (2009) were surprised by the high frequency of the word in SUBTLEX-US (1.23 per million words or Zipf = 3.09). Further investigation showed that the high frequency of this word was due to a single movie with the character John Creasy. Contextual diversity corrected for the problem (because nearly all instances were limited to one context). GPT_Fam also corrects for the problem, because it penalizes inconsistency between word form and context. Indeed, the GPT familiarity index of the word creasy is only 1.23, reflecting the fact that the word is hardly used as an adjective in English.

Two other words with heavily overestimated word frequencies in SUBTLEX-US were don and haven (Zipf values of 4.2 and 3.88, respectively). This was caused by the fact that removing the punctuation marks from the corpus inadvertently resulted in recoding don’t as “don t” and haven’t as “haven t”. Because there is no overlap between the contexts in which the expressions don and don’t occur, GPT familiarity estimates are less affected by such confounding (the familiarity estimate of don is 4.7, against 7.0 for don’t; 5.7 for haven, against 7.0 for haven’t).

We can even go a step further and hypothesize that the ambiguous status of words in multi-word expressions is one of the reasons why word frequency is strongly correlated with word length, unlike LLM familiarity. The frequency count of the word “hand” is high (Multilex Zipf = 5.48), because it includes instances of the word hand in multi-word expressions, such as “bridge hand,” “first hand,” “hand in,” “hand over,” “hand in hand,” etc. In contrast, the frequency count of “secondhand” (Multilex Zipf = 2.99) is low, partly because it does not include the spellings “second-hand” or “second hand.” LLMs are much better at generalizing across spelling variants of long words that occur in similar contexts. The familiarity estimates of secondhand, second-hand and second hand are 6.4, 6.2 and 6.4, respectively (compared to 7.0 for hand).

Familiarity estimates for multi-word expressions

As described earlier, we collected and validated AI-generated familiarity estimates for words because we hoped they would be informative for multi-word expressions (MWEs). Indeed, an interesting aspect of LLM-based estimates is that they are not limited to isolated words. In modern LLMs, input no longer consists of words as units, but of concatenated strings of letters, punctuation marks, and spaces. As a result, there are no boundaries between words, and word sequences are processed in the same way as individual words. GPT is as willing to provide a familiarity estimate for “break the ice” (6.12/7) as for “socialize” (6.78/7).

The familiarity estimates for MWEs were collected in the same way as for words. Unfortunately, there is currently not much data that we can use to directly validate GPT estimates for multi-word expressions. The only large dataset available consists of concreteness ratings for 62,000 English MWEs, collected by Muraki et al. (2023). Martínez et al. (2025) found that GPT estimates for these MWEs correlated as strongly with the human ratings (r = .8) as did GPT estimates with concreteness ratings for 40,000 individual words collected by Brysbaert et al. (2014).

One source of information we can use for familiarity estimates is the number of participants indicating they did not know the stimulus well enough to provide a concreteness rating. Both Brysbaert et al. (2014) and Muraki et al. (2023) asked participants not to rate words/expressions they did not know well enough. Instead, participants were asked to write X for those entries.

Using this information, we can compare GPT familiarity estimates for words/expressions that people know versus words/expressions that they do not know. We present the data for a threshold of 90% known (other thresholds give similar results). What GPT estimates are given to words/expressions known by less than 90% of reviewers, and how do these estimates compare to the estimates of words/expressions that more than 90% of reviewers say they know? More importantly, do the distributions look the same for words and MWEs, as should be if the GPT estimates for MWEs are similar to those for individual words?

The results are shown in Fig. 4. Although more multi-word expressions were known than words, the values within the distributions are very similar. GPT estimates below 3.5 are unlikely to be known to 90% of human raters; estimates above this threshold have a high probability of being known.

[See PDF for image]

Fig. 4

Distributions of GPT familiarity scores for words/expressions known to less than 90% of the human raters (purple) and those known to 90% or more. Left side: single words; right side: multi-word expressions. The figures were made with the R package ggplot2 (Wickham & Chang, 2016)

It is also worth looking at the exceptions. What stimuli get a GPT score above 6 and yet are not known to the participants? For the multi-word expressions, examples are: “Jesus of Nazareth”, “what’s that”, “seven hundred”, “gas station”, “chicken wings”. These examples suggest that the problem lies with human reviewers rather than GPT estimates (to be fair, most of these expressions had human acceptance rates slightly below 90%). Examples of expressions known to human evaluators but unknown to GPT are: “do a moonie”, “gone north about”, “head-emptier”, “couple on”, “dotriacontanoic acid”, “ioxitalamic acid”, and “hentriacontanoic acid”. Again, one might wonder who is making the wisest decisions here.

Researchers can also use multi-word expressions in creative ways. For example, Scott et al. (2019) included 379 ambiguous words in their stimulus materials. These words were either presented alone (e.g., toast) or with information that selected an alternative meaning. Then they were written as “toast (bread)” and “toast (speech)”. These stimuli were not included in the word analyses of Table 3 because there are no word frequency measures for them. However, LLMs do provide scores for such multi-word expressions as easily as for isolated words, which we can compare with human ratings. Indeed, this is what Trott (2024b) did for three different features (concreteness, semantic similarity and valence), reporting high correlations (r > .75) between the human ratings and GPT-4 estimates.

The Pearson correlation between the human familiarity ratings and GPT estimates for the words without disambiguation was r = .76, slightly lower than the overall correlation because many of the words were well known (range restriction). When we specifically compared the values for the ambiguous and disambiguated words, we saw that both people and GPT on average gave higher values for the ambiguous words than for the disambiguated words (Fig. 5). This is because some of the disambiguated words are not commonly known. The 14 stimuli with the lowest familiarity ratings by humans were: over (cricket), duck (cricket), traction (medical), hawk (sell), steer (cows), quack (charlatan), hooker (rugby), sage (wise man), crook (stick), page (ward), lark (fun), pan (criticize), coke (coal), and bridal (horse). The 10 stimuli with the lowest GPT estimates were: head (semen), page (ward), couch (express), corn (foot), pine (long for), bridal (horse), habit (nun), bunk (truant), camp (effeminate), count (title). Both sets comprise word meanings unknown to many people. The comparable performance of people and GPT adds further evidence to the argument that GPT estimates can be used as a proxy for human ratings of familiarity.

[See PDF for image]

Fig. 5

Ratings given by people (left panel) and GPT (right panel) for ambiguous words and their disambiguated meanings

Discussion

The present line of research started from the question whether large language models (LLMs) can be used to measure familiarity with multi-word expressions (MWEs). Frequency counts are difficult to compute for such expressions because the expressions can take many forms and are complicated to extract in a lemmatized manner.

Before we could argue that LLM familiarity estimates are useful for MWEs, we had to establish that they work for individual words. There are five large datasets of human familiarity ratings that could be used as validation criteria. It was reassuring to see that the correlations between GPT4 familiarity estimates and human ratings were equal to the intercorrelations of the human datasets themselves (Table 2), indicating that LLM estimates can be used as an approximation for human ratings.

Further analysis revealed that the LLM estimates were more influenced by word frequency and word prevalence than the human ratings. This opened up the possibility that the familiarity estimates would be good to predict word processing efficiency in megastudies. Indeed, this is what we found. LLM familiarity estimates were a better predictor of word accuracy and performed as well or even slightly better than the best available word frequency measures to predict word processing speed when word length was taken into account (Tables 4 and 5). This was better than we expected, reviving the old literature of the merits of word frequency versus word familiarity (Cortese & Khanna, 2007; Gernsbacher, 1984; Haagen, 1949).

The good prediction of performance in word processing megastudies is further important because it shows that GPT estimates are not limited to words that may have been part of the training material but also generalize to the 40,000 words that have never been rated by humans (see the issue of leakage and contamination, discussed in the introduction). The estimates even generalize to expressions consisting of multiple words, as demonstrated in the current paper and in Martínez et al. (2025). This makes it much easier to conduct research on such expressions.

The good performance of GPT_Fam estimates even raises the question of whether they should be preferred over word frequency measures, considering the cautionary remarks about AI-based judgments expressed by Atari et al. (2023), Dillion et al. (2023), Grieve et al. (2024), Messeri and Crockett (2024), and Trott (2024a). In this respect, we endorse the conclusion reached by the previous authors: neither vilification nor deification of AI-generated estimates is likely to help us much. Given their good performance even for words and expressions not previously evaluated, GPT_Fam estimates offer an interesting option for stimulus selection and control because the information is available in large quantities and is predictive of human performance.

At the same time, it is good to keep in mind that human ratings are the golden standard and that collecting such ratings is strongly recommended if familiarity is the central variable of theory-driven research. Then, it is important to collect new, uncontaminated information, especially since the amount of nonhuman-produced text is growing rapidly. In our analysis, we saw little reason for concern about leakage and contamination from training materials, but that may no longer be the case once large-scale databases such as the current one become available online. Then, the training materials may start to have a stronger impact on the estimates given by the algorithms. In this respect, we may have been lucky and worked at exactly the right time, when LLMs became powerful enough to be good and not yet contaminated enough to deviate from the language that humans use and produce.

Newly collected human ratings will always be the best test for AI-generated estimates because you can be sure that the ratings have not yet been used in the training material for the algorithm (Trott, 2024a). Another way forward may become available when research-based LLMs catch up with ChatGPT4 (which is likely in the coming years), as this gives researchers more information/control over the training regime and the input given to the model.

Despite calls for caution, which we endorse, our findings indicate that AI-based estimates of word features have great pragmatic utility for stimulus selection and control. They even raise interesting theoretical questions. One is whether we should revisit the word prediction effect in text reading. Recent articles have shown that fixation duration on words is affected by the predictability of words within the prior context (also called surprisal), even when controlling for word frequency (Brothers & Kuperberg, 2021; Cevoli et al., 2022; De Varda & Marelli, 2022; Heilbron et al., 2023; Frank, 2009; Wilcox et al., 2023). Our findings strongly suggest that these analyses should be done again using word familiarity as a control rather than word frequency. Is there still a surprisal effect when controlling for context-free familiarity in a network and for word length?

Also, the relative importance of word length and word frequency in word processing will need re-evaluation (Barton et al., 2014; Carter & Luke, 2019; Hauk & Pulvermüller, 2004; Hudson & Bergman, 1985; Juhasz & Rayner, 2003; Kliegl et al., 2004; Koplenig et al., 2022; Kuperman et al., 2024; Meylan & Griffiths, 2024). The dominant view now is that word frequency is more important than word length. However, due to the negative correlation between frequency and length (e.g., Pearson correlation of – 0.43 between word length and Multilex word frequency in ELP, versus – 0.14 with GPT familiarity), authors may have underestimated the impact of word length. If word frequency counts are biased towards short words (for reasons we described earlier), the word frequency effect may be partially a word length effect in disguise.

Availability of the LLM familiarity estimates

An advantage of LLM estimates is that they can be generated for large lists of words, like word frequency. To avoid everyone having to compile their own list (and to provide a timestamped version that can be shared), we have compiled a list of 417,118 entries (263,780 words, 40,468 hyphenated words, and 112,870 MWEs) for which we collected GPT4o familiarity estimates. The list is available at https://osf.io/c2yef/.

The list was constructed from the various word lists used in megastudies and crowdsourcing studies, from previous familiarity databases, and from existing word frequency lists, including recent lists of child frequencies (Green et al., 2024; Korochkina et al., 2024). A number of these entries are faulty due to misspellings or omitted spaces, but they provide readers with a good basis for testing all types of stimuli.

For most applications, the database can be cleaned up based on the GPT familiarity estimates. As we saw in Fig. 4, stimuli with a familiarity lower than 3.5 are unlikely to be of research interest, except for learning experiments that require unknown stimuli. Therefore, we also provide a cleaned-up list with familiarity ratings above 3.5, which limited the list to 65,304 words, 19,783 hyphenated words, and 61,805 MWEs. The lists also contain the Multilex frequencies of the words.

Even better, it is easy to obtain LLM familiarity estimates for stimuli not in our list. All you have to do is use the prompt we used and submit a request using OpenAI’s application programming interface to get the probabilities of the different values and compute the familiarity estimate. If the query is done with GPT4o, you should get numbers close to those we obtained (depending on further developments of the model). Queries with other state-of-the-art LLM models should also give close values (correlations of estimates from other models were up to .1 lower than with GPT-4o but are likely to improve in the near future). The best way to proceed is probably to try a few stimuli from our list first and query the new stimuli once you know you are getting the right output.

Authors’ contributions

All authors have contributed to the ideas tested in the paper (and others that did not make the end report). Running the tests was done by the authors from Madrid. Ghent was ultimately responsible for statistical analysis and writing.

Funding

This research was supported by the FUN4DATE (PID2022-136684OB-C21/C22) and ENTRUDIT (TED2021-130118B-I00) projects funded by the Spanish Agencia Estatal de Investigacion (AEI) https://doi.org/10.13039/501100011033 and by the OpenAI research access program, which provided access to ChatGPT-4o on a non-commercial basis.

Availability of data and materials

All data and materials are available at https://osf.io/c2yef/. This includes the list of 417,118 words/expressions with GPT4 familiarity estimates and Multilex frequencies, the cleaned list of 146,892 words/expressions with GPT estimates above 3.5 and Multilex word frequencies.

Code availability

The R code used for the analyses is also available at https://osf.io/c2yef/.

Declarations

Ethics approval

The studies did not involve people and followed the General Ethical Protocol of the Faculty of Psychology and Educational Sciences at Ghent University. Therefore, they need no explicit approval from the Faculty.

Consent to participate

The studies did not involve new data collection from people.

Consent for publication

All authors consent.

Conflict of interest

The authors ran the studies independently and do not expect any financial gain from them.

Instructions inspired by the Hoosier norms resulted in correlations that were about .05 lower. Most words received high familiarity estimates according to these instructions. See the osf project site for the specific prompt that was used.

All data are on the osf site, for readers dying to look at regional differences.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Adelman, JS; Brown, GD; Quesada, JF. Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science; 2006; 17, 9 pp. 814-823. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/16984300][DOI: https://dx.doi.org/10.1111/j.1467-9280.2006.01787.x]

Atari, M., Xue, M. J., Park, P. S., Blasi, D., & Henrich, J. (2023). Which humans? Available at https://osf.io/preprints/psyarxiv/5b26t. Accessed 14 Dec 2024

Baayen, RH; Milin, P; Ramscar, M. Frequency in lexical processing. Aphasiology; 2016; 30, 11 pp. 1174-1220. [DOI: https://dx.doi.org/10.1080/02687038.2016.1147767]

Balota, DA; Pilotti, M; Cortese, MJ. Subjective frequency estimates for 2,938 monosyllabic words. Memory & Cognition; 2001; 29, pp. 639-647. [DOI: https://dx.doi.org/10.3758/BF03200465]

Balota, DA; Yap, MJ; Cortese, MJ; Hutchison, KA; Kessler, B; Loftis, B et al. The English Lexicon Project. Behavior Research Methods; 2007; 39, pp. 445-459. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/17958156][DOI: https://dx.doi.org/10.3758/BF03193014]

Barton, JJ; Hanif, HM; Eklinder Björnström, L; Hills, C. The word-length effect in reading: A review. Cognitive Neuropsychology; 2014; 31, 5–6 pp. 378-412. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24665973][DOI: https://dx.doi.org/10.1080/02643294.2014.895314]

Bertram, R; Schreuder, R; Baayen, RH. The balance of storage and computation in morphological processing: The role of word formation type, affixal homonymy, and productivity. Journal of Experimental Psychology. Learning, Memory, and Cognition; 2000; 26, 2 pp. 489-511. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/10764108][DOI: https://dx.doi.org/10.1037/0278-7393.26.2.489]

Breheny, P., & Burchett, W. (2020). Package ‘visreg’ Version 2.7.0. Available at http://r.meteo.uni.wroc.pl/web/packages/visreg/visreg.pdf. Accessed 14 Dec 2024

Brothers, T; Kuperberg, GR. Word predictability effects are linear, not logarithmic: Implications for probabilistic models of sentence comprehension. Journal of Memory and Language; 2021; 116, [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33100508][DOI: https://dx.doi.org/10.1016/j.jml.2020.104174] 104174.

Brysbaert, M; Cortese, MJ. Do the effects of subjective frequency and age of acquisition survive better word frequency norms?. Quarterly Journal of Experimental Psychology; 2011; 64, 3 pp. 545-559. [DOI: https://dx.doi.org/10.1080/17470218.2010.503374]

Brysbaert, M; New, B. Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods; 2009; 41, 4 pp. 977-990. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/19897807][DOI: https://dx.doi.org/10.3758/BRM.41.4.977]

Brysbaert, M; Buchmeier, M; Conrad, M; Jacobs, AM; Bölte, J; Böhl, A. The word frequency effect. Experimental Psychology; 2011; 58, 5 pp. 412-424. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/21768069][DOI: https://dx.doi.org/10.1027/1618-3169/a000123]

Brysbaert, M; Keuleers, E; New, B. Assessing the usefulness of Google Books’ word frequencies for psycholinguistic research on word processing. Frontiers in Psychology; 2011; 2, 27. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/21713191][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3111095][DOI: https://dx.doi.org/10.3389/fpsyg.2011.00027]

Brysbaert, M; Warriner, AB; Kuperman, V. Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods; 2014; 46, pp. 904-911. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24142837][DOI: https://dx.doi.org/10.3758/s13428-013-0403-5]

Brysbaert, M; Stevens, M; Mandera, P; Keuleers, E. How many words do we know? Practical estimates of vocabulary size dependent on word definition, the degree of language input and the participant’s age. Frontiers in Psychology; 2016; 7, 1116. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27524974][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4965448][DOI: https://dx.doi.org/10.3389/fpsyg.2016.01116]

Brysbaert, M; Mandera, P; Keuleers, E. The word frequency effect in word processing: An updated review. Current Directions in Psychological Science; 2018; 27, 1 pp. 45-50. [DOI: https://dx.doi.org/10.1177/0963721417727521]

Brysbaert, M; Mandera, P; McCormick, SF; Keuleers, E. Word prevalence norms for 62,000 English lemmas. Behavior Research Methods; 2019; 51, pp. 467-479. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29967979][DOI: https://dx.doi.org/10.3758/s13428-018-1077-9]

Burgess, C; Livesay, K. The effect of corpus size in predicting reaction time in a basic word recognition task: Moving on from Kukera and Francis. Behavior Research Methods, Instruments, & Computers; 1998; 30, pp. 272-277. [DOI: https://dx.doi.org/10.3758/BF03200655]

Carroll, JB. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour; 1970; 3, 2 pp. 61-65.

Carter, BT; Luke, SG. The effect of convolving word length, word frequency, function word predictability and first pass reading time in the analysis of a fixation-related fMRI dataset. Data in Brief; 2019; 25, [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31463340][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6706769][DOI: https://dx.doi.org/10.1016/j.dib.2019.104171] 104171.

Cevoli, B; Watkins, C; Rastle, K. Prediction as a basis for skilled reading: Insights from modern language models. Royal Society Open Science; 2022; 9, 6 [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35719885][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9198501][DOI: https://dx.doi.org/10.1098/rsos.211837] 211837.

Chang, M; Jones, MN; Johns, BT. Comparing word frequency, semantic diversity, and semantic distinctiveness in lexical organization. Journal of Experimental Psychology. General; 2023; 152, 6 pp. 1814-1823. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37307352][DOI: https://dx.doi.org/10.1037/xge0001407]

Chen, X; Dong, Y. Evaluating objective and subjective frequency measures in L2 lexical processing. Lingua; 2019; 230, [DOI: https://dx.doi.org/10.1016/j.lingua.2019.102738] 102738.

Clark, JM; Paivio, A. Extensions of the Paivio, Yuille, and Madigan (1968) norms. Behavior Research Methods, Instruments, & Computers; 2004; 36, 3 pp. 371-383. [DOI: https://dx.doi.org/10.3758/BF03195584]

Coltheart, M. The MRC psycholinguistic database. The Quarterly Journal of Experimental Psychology Section A; 1981; 33, 4 pp. 497-505. [DOI: https://dx.doi.org/10.1080/14640748108400805]

Cortese, MJ; Khanna, MM. Age of acquisition predicts naming and lexical-decision performance above and beyond 22 other predictor variables: An analysis of 2,342 words. Quarterly Journal of Experimental Psychology; 2007; 60, 8 pp. 1072-1082. [DOI: https://dx.doi.org/10.1080/17470210701315467]

De Varda, A., & Marelli, M. (2022, November). The effects of surprisal across languages: Results from native and non-native reading. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022 (pp. 138–144).

Dillion, D; Tandon, N; Gu, Y; Gray, K. Can AI language models replace human participants?. Trends in Cognitive Sciences; 2023; 27, 7 pp. 597-600. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37173156][DOI: https://dx.doi.org/10.1016/j.tics.2023.04.008]

Frank, S. (2009). Surprisal-based comparison between a symbolic and a connectionist model of sentence processing. In Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 31, No. 31).

Gao, C; Shinkareva, SV; Desai, RH. SCOPE: The South Carolina Psycholinguistic Metabase. Behavior Research Methods; 2023; 55, 6 pp. 2853-2884. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35971041][DOI: https://dx.doi.org/10.3758/s13428-022-01934-0]

Gernsbacher, MA. Resolving 20 years of inconsistent interactions between lexical familiarity and orthography, concreteness, and polysemy. Journal of Experimental Psychology: General; 1984; 113, 2 pp. 256-281. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/6242753][DOI: https://dx.doi.org/10.1037/0096-3445.113.2.256]

Gilhooly, KJ; Logie, RH. Age of acquisition, imagery, concreteness, familiarity and ambiguity measures for 1944 words. Behaviour Research Methods and Instrumentation; 1980; 12, pp. 395-427. [DOI: https://dx.doi.org/10.3758/BF03201693]

Gimenes, M; New, B. Worldlex: Twitter and blog word frequencies for 66 languages. Behavior Research Methods; 2016; 48, pp. 963-972. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26170053][DOI: https://dx.doi.org/10.3758/s13428-015-0621-0]

Goh, WD; Yap, MJ; Chee, QW. The Auditory English Lexicon Project: A multi-talker, multi-region psycholinguistic database of 10,170 spoken words and nonwords. Behavior Research Methods; 2020; 52, 5 pp. 2202-2231. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32291734][DOI: https://dx.doi.org/10.3758/s13428-020-01352-0]

Green, C; Keogh, K; Sun, H; O’Brien, B. The Children’s Picture Books Lexicon (CPB-LEX): A large-scale lexical database from children’s picture books. Behavior Research Methods; 2024; 56, pp. 4504-2024. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37566336][DOI: https://dx.doi.org/10.3758/s13428-023-02198-y]

Gries, ST. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics; 2008; 13, 4 pp. 403-437. [DOI: https://dx.doi.org/10.1075/ijcl.13.4.02gri]

Grieve, J., Bartl, S., Fuoli, M., Grafmiller, J., Huang, W., Jawerbaum, A., .. & Winter, B. (2024). The Sociolinguistic Foundations of Language Modeling. arXiv preprint arXiv:2407.09241

Haagen, CH. Synonymity, vividness, familiarity, and association value ratings of 400 pairs of common adjectives. The Journal of Psychology; 1949; 27, 2 pp. 453-463. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/18111584][DOI: https://dx.doi.org/10.1080/00223980.1949.9917435]

Harrell, F. E. Jr. (2024). Package’ rms' Version 6.8–2. Available at https://cran.r-project.org/web/packages/rms/rms.pdf. Accessed 14 Dec 2024

Hauk, O; Pulvermüller, F. Effects of word length and frequency on the human event-related potential. Clinical Neurophysiology; 2004; 115, 5 pp. 1090-1103. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/15066535][DOI: https://dx.doi.org/10.1016/j.clinph.2003.12.020]

Heilbron, M; van Haren, J; Hagoort, P; de Lange, FP. Lexical processing strongly affects reading times but not skipping during natural reading. Open Mind; 2023; 7, pp. 757-783. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37840763][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10575561][DOI: https://dx.doi.org/10.1162/opmi_a_00099]

Hills, J., & Anadkat, S. (2023). Using logprobs. Available on October 18, 2024 at https://cookbook.openai.com/examples/using_logprobs. Accessed 14 Dec 2024

Hudson, PT; Bergman, MW. Lexical knowledge in word recognition: Word length and word frequency in naming and lexical decision tasks. Journal of Memory and Language; 1985; 24, 1 pp. 46-58. [DOI: https://dx.doi.org/10.1016/0749-596X(85)90015-4]

Johns, B. T., & Jones, M. N. (2022). Content matters: Measures of contextual diversity must consider semantic content. Journal of Memory and Language, 123, Article 104313. https://doi.org/10.1016/j.jml.2021.104313

Juhasz, BJ; Rayner, K. Investigating the effects of a set of intercorrelated variables on eye fixation durations in reading. Journal of Experimental Psychology: Learning, Memory, and Cognition; 2003; 29, 6 1312. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/14622063]

Juhasz, B. J., Lai, Y. H., & Woodcock, M. L. (2015). A database of 629 English compound words: ratings of familiarity, lexeme meaning dominance, semantic transparency, age of acquisition, imageability, and sensory experience. Behavior Research Methods, 47, 1004–1019.

Keuleers, E; Lacey, P; Rastle, K; Brysbaert, M. The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods; 2012; 44, pp. 287-304. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/21720920][DOI: https://dx.doi.org/10.3758/s13428-011-0118-4]

Kliegl, R; Grabner, E; Rolfs, M; Engbert, R. Length, frequency, and predictability effects of words on eye movements in reading. European Journal of Cognitive Psychology; 2004; 16, 1–2 pp. 262-284. [DOI: https://dx.doi.org/10.1080/09541440340000213]

Koplenig, A; Kupietz, M; Wolfer, S. Testing the relationship between word length, frequency, and predictability based on the German reference corpus. Cognitive Science; 2022; 46, 6 [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35661231][DOI: https://dx.doi.org/10.1111/cogs.13090] e13090.

Korochkina, M; Marelli, M; Brysbaert, M; Rastle, K. The Children and Young People's Books Lexicon (CYP-LEX): A large-scale lexical database of books read by children and young people in the United Kingdom. Quarterly Journal of Experimental Psychology; 2024; 2006, 17470218241229694.

Kuperman, V; Schroeder, S; Gnetov, D. Word length and frequency effects on text reading are highly similar in 12 alphabetic languages. Journal of Memory and Language; 2024; 135, [DOI: https://dx.doi.org/10.1016/j.jml.2023.104497] 104497.

Mandera, P; Keuleers, E; Brysbaert, M. Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting: A review and empirical validation. Journal of Memory and Language; 2017; 92, pp. 57-78. [DOI: https://dx.doi.org/10.1016/j.jml.2016.04.001]

Mandera, P; Keuleers, E; Brysbaert, M. Recognition times for 62 thousand English words: Data from the English Crowdsourcing Project. Behavior Research Methods; 2020; 52, pp. 741-760. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31368025][DOI: https://dx.doi.org/10.3758/s13428-019-01272-8]

Martínez, G., Conde, J., Reviriego, P., Merino-Gómez, E., Hernández, J. A., & Lombardi, F. (2023). How many words does ChatGPT know? The answer is ChatWords. arXiv preprint arXiv:2309.16777

Martínez, G., Molero, J. D., González, S., Conde, J., Brysbaert, M., & Reviriego, P. (2025). Using large language models to estimate features of multi-word expressions: Concreteness, valence, arousal. Behavior Research Methods, 57(5), 1–11.

Messeri, L; Crockett, MJ. Artificial intelligence and illusions of understanding in scientific research. Nature; 2024; 627, 8002 pp. 49-58. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38448693][DOI: https://dx.doi.org/10.1038/s41586-024-07146-0]

Meylan, SC; Griffiths, TL. Word forms reflect trade-offs between speaker effort and robust listener recognition. Cognitive Science; 2024; 48, 7 [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38980972][DOI: https://dx.doi.org/10.1111/cogs.13478] e13478.

Muraki, EJ; Abdalla, S; Brysbaert, M; Pexman, PM. Concreteness ratings for 62,000 English multi-word expressions. Behavior Research Methods; 2023; 55, 5 pp. 2522-2531. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35867207][DOI: https://dx.doi.org/10.3758/s13428-022-01912-6]

Nusbaum, H. C., Pisoni, D. B., & Davis, C. K. (1984). Sizing up the Hoosier mental lexicon. Research on spoken language processing report, 10.

Pavio, A., Yuille, J. C. & Madigan, S. A. (1968). Concreteness, imagery and meaningfulness values for 925 words. Journal of Experimental Psychology Monograph Supplement, 76 (3, part 2).

Preston, KA. The speed of word perception and its relation to reading ability. The Journal of General Psychology; 1935; 13, 1 pp. 199-203. [DOI: https://dx.doi.org/10.1080/00221309.1935.9917878]

Schreuder, R; Baayen, RH. How complex simplex words can be. Journal of Memory and Language; 1997; 37, 1 pp. 118-139. [DOI: https://dx.doi.org/10.1006/jmla.1997.2510]

Scott, GG; Keitel, A; Becirspahic, M; Yao, B; Sereno, SC. The Glasgow Norms: Ratings of 5,500 words on nine scales. Behavior Research Methods; 2019; 51, pp. 1258-1270. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30206797][DOI: https://dx.doi.org/10.3758/s13428-018-1099-3]

Stadthagen-Gonzalez, H; Davis, CJ. The Bristol norms for age of acquisition, imageability, and familiarity. Behavior Research Methods; 2006; 38, 4 pp. 598-605. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/17393830][DOI: https://dx.doi.org/10.3758/BF03193891]

Thorndike, E. L. (1931). A teacher's word book of twenty thousand words. Columbia University.

Trott, S. Can large language models help augment English psycholinguistic datasets?. Behavior Research Methods; 2024; 56, pp. 6082-6100. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38261264][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11335796][DOI: https://dx.doi.org/10.3758/s13428-024-02337-z]

Trott, S. Large language models and the wisdom of small crowds. Open Mind; 2024; 8, pp. 723-738. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38828431][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11142632][DOI: https://dx.doi.org/10.1162/opmi_a_00144]

Vaden, K. I., Halpin, H. R., & Hickok, G. S. (2009). Irvine phonotactic online dictionary, Version 2.0. [Data file]. Available from https://www.iphod.com. Accessed 14 Dec 2024

Van Heuven, WJ; Mandera, P; Keuleers, E; Brysbaert, M. SUBTLEX-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology; 2014; 67, 6 pp. 1176-1190. [DOI: https://dx.doi.org/10.1080/17470218.2013.850521]

Van Paridon, J; Thompson, B. subs2vec: Word embeddings from subtitles in 55 languages. Behavior Research Methods; 2021; 53, 2 pp. 629-655. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32789660][DOI: https://dx.doi.org/10.3758/s13428-020-01406-3]

Westbury, C. You can’t drink a word: Lexical and individual emotionality affect subjective familiarity judgments. Journal of Psycholinguistic Research; 2014; 43, pp. 631-649. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24061785][DOI: https://dx.doi.org/10.1007/s10936-013-9266-2]

Wickham, H., & Chang, W. (2016). Package ‘ggplot2’. Create elegant data visualisations using the grammar of graphics. Version 1.9.1. Available at https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=af53fd2f5b9e81b6edec0c13e1b3babd34bda399. Accessed 14 Dec 2024

Wilcox, EG; Pimentel, T; Meister, C; Cotterell, R; Levy, RP. Testing the predictions of surprisal theory in 11 languages. Transactions of the Association for Computational Linguistics; 2023; 11, pp. 1451-1470. [DOI: https://dx.doi.org/10.1162/tacl_a_00612]

Winter, B; Lupyan, G; Perry, LK; Dingemanse, M; Perlman, M. Iconicity ratings for 14,000+ English words. Behavior Research Methods; 2024; 56, 3 pp. 1640-1655. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37081237][DOI: https://dx.doi.org/10.3758/s13428-023-02112-6]

Wood, S. (2023). Package ‘mgcv’. R package Version 1.9–1. Available at https://cran.r-project.org/web/packages/mgcv/mgcv.pdf. Accessed 14 Dec 2024

Wu, T; He, S; Liu, J; Sun, S; Liu, K; Han, QL; Tang, Y. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica; 2023; 10, 5 pp. 1122-1136. [DOI: https://dx.doi.org/10.1109/JAS.2023.123618]

Xu, W., & Rudnicky, A. (2000). Can artificial neural networks learn language models? Available at https://kilthub.cmu.edu/articles/journal_contribution/Can_Artificial_Neural_Networks_Learn_Language_Models_/6604016/1/files/12094409.pdf. Accessed 14 Dec 2024

Word count: 10622

Show less

Moving beyond word frequency based on tally counting: AI-generated familiarity estimates of words and phrases are an interesting additional index of language knowledge

Content area

Abstract

Full text