Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find out the entropy of the English language

Tags:

nlp

entropy

How to find out the entropy of the English language by using isolated symbol probabilities of the language?

like image 594
wcwchamara Avatar asked Mar 07 '12 15:03

wcwchamara


People also ask

What is the entropy over words in English?

Shannon estimates the entropy of the set of words in printed English as 11.82 bits per word. As this figure seems inconsistent with some results deduced from several encoding procedures, the entropy was recalculated and found to be roughly 9.8 bits per word.

How do you find the entropy of the alphabet?

Solution: (a) The maximum possible entropy of an alphabet consisting of N different letters is H = log2 N. This is only achieved if the probability of every letter is 1/N. Thus 1/N is the probability of both the “most likely” and the “least likely” letter.

What is entropy in language?

The amount of information carried in the arrangement of words is the same across all languages, even languages that aren't related to each other. This consistency could hint at a single common ancestral language, or universal features of how human brains process speech.


1 Answers

If we define 'isolated symbol probabilities' in the way it's done in this SO answer, we would have to do the following:

  1. Obtain a representative sample of English text (perhaps a carefully selected corpus of news articles, blog posts, some scientific articles and some personal letters), as large as possible

  2. Iterate through its characters and count the frequency of occurrence of each of them

  3. Use the frequency, divided by the total number of characters, as estimate for each character's probability

  4. Calculate the average length in bits of each character by multiplying its probability with the negative logarithm of that same probability (the base-2 logarithm if we want the unit of entropy to be bit)

  5. Take the sum of all average lengths of all characters. That is the result.

Caveats:

  • This isolated-symbols entropy is not what is usually referred to as Shannon's entropy estimate for English. Shannon based the entropy on conditional n-gram probabilities, rather than isolated symbols, and his famous 1950 paper is largely about how to determine the optimal n.

  • Most people who try to estimate the entropy of English exclude punctuation characters and normalise all text to lowercase.

  • The above assumes that a symbol is defined as a character (or letter) of English. You could do a similar thing for entire words, or other units of text.

Code example:

Here is some Python code that implements the procedure described above. It normalises the text to lowercase and excludes punctuation and any other non-alphabetic, non-whitespace character. It assumes that you have put together a representative corpus of English and provide it (encoded as ASCII) on STDIN.

import re
import sys
from math import log

# Function to compute the base-2 logarithm of a floating point number.
def log2(number):
    return log(number) / log(2)

# Function to normalise the text.
cleaner = re.compile('[^a-z]+')
def clean(text):
    return cleaner.sub(' ',text)

# Dictionary for letter counts
letter_frequency = {}

# Read and normalise input text
text = clean(sys.stdin.read().lower().strip())

# Count letter frequencies
for letter in text:
    if letter in letter_frequency:
        letter_frequency[letter] += 1
    else:
        letter_frequency[letter] = 1

# Calculate entropy
length_sum = 0.0
for letter in letter_frequency:
    probability = float(letter_frequency[letter]) / len(text)
    length_sum += probability * log2(probability)

# Output
sys.stdout.write('Entropy: %f bits per character\n' % (-length_sum))
like image 171
jogojapan Avatar answered Jan 31 '23 17:01

jogojapan