Mathematical Formulation
Zipf’s Law can be understood intuitively by considering that in any language, there are a few extremely common words (e.g., “the,” “of,” “and”) that are used very frequently, while the vast majority of words are used relatively infrequently. This distribution of word frequencies follows a power-law distribution, where the frequency of a word is proportional to its rank raised to a negative power.
Mathematically, Zipf’s Law can be expressed as:
[Tex]f(r) = \frac{C}{r^s} [/Tex]
where f(r) is the frequency of the word at rank r, C is a constant, and s is the Zipf exponent.
Key concepts and terms:
- Zipf exponent: The exponent in Zipf’s Law equation determines the steepness of the frequency distribution curve. It reflects the degree of inequality in word frequencies.
- Rank-frequency distribution: A plot showing the relationship between the rank of words in a language and their frequency of occurrence.
Zipf’s Law
Zipf’s law is an empirical formula discovered by George Zipf in 1930s. Zip’s law describes the relationship between the frequency of words in language corpus and their rank in a frequency sorted list. In this article, we will be diving into the concept of Zipf’s law and its application in natural language processing.
Table of Content
- What is Zipf’s Law?
- Mathematical Formulation
- Example of Zipf’s Law
- Python Implementation of Zipf’s Law
- Applications
- Deviation from Zipf’s Law