Content area
Full Text
Published online: 25 March 2014
© Psychonomic Society, Inc. 2014
Abstract The frequency distribution of words has been a key object of study in statistical linguistics for the past 70 years. This distribution approximately follows a simple mathematical form known as Zipf's law. This article first shows that human language has a highly complex, reliable structure in the frequency distribution over and above this classic law, although prior data visualization methods have obscured this fact. A number of empirical phenomena related to word frequencies are then reviewed. These facts are chosen to be informative about the mechanisms giving rise to Zipf's law and are then used to evaluate many of the theoretical explanations of Zipf's law in language. No prior account straight-forwardly explains all the basic facts or is supported with independent evaluation of its underlying assumptions. To make progress at understanding why language obeys Zipf's law, studies must seek evidence beyond the law itself, testing assumptions and evaluating novel predictions with new, independent data.
Keywords Language · Zipf's law · Statistics
(ProQuest: ... denotes formulae omitted.)
Introduction
One of the most puzzling facts about human language is also one of the most basic: Words occur according to a famously systematic frequency distribution such that there are few very high-frequency words that account for most of the tokens in text (e.g., "a,""the,""I," etc.) and many low-frequency words (e.g., "accordion,""catamaran,""ravioli"). What is striking is that the distribution is mathematically simple, roughly obeying a power law known as Zipf 's law: The rth most frequent word has a frequency f(r) that scales according to
... (1)
for α [asymptotically =] 1(Zipf,1936, 1949).1 In this equation, r is called the frequency rank of a word, and f (r) is its frequency in a natural corpus. Since the actual observed frequency will depend on the size of the corpus examined, this law states frequencies proportionally: The most frequent word (r = 1) has a frequen- cy proportional to 1, the second most frequent word (r =2)has a frequency proportional to 21α , the third most frequent word has a frequency proportional to 31α , and so forth.
Mandelbrot proposed and derived a generalization of this law that more closely fits the frequency distribution in lan- guage by "shifting"...