Zipf's Law

Zipf, Statistics, Word Frequency1 min read

I was reading a paper on Bloom Filters where the authors mention Zipfian distribution. Zipfian distribution or Zipf's law is an inverse relation distribution. It looks different from the normal/standard "bell shaped" distribution. By multiplying the rank and frequency of any point on the distribution we should be close to a constant regardless of which point we choose.

This distribution can be seen in word frequencies across all languages. There seems to be a tradeoff between short one syllable words (easier for humans to speak) and the chance of miscommunication.

Following this video, I took the text of The Count of Monte Cristo and ran a word frequency analysis on the corpus using AntConc. Then I uploaded the results to Google Sheets and created the chart below: Zipf distribution

Notice how the corpus follows the same Zipf's distribution.

