How to find out the entropy of the English language by using isolated symbol probabilities of the language?
Shannon estimates the entropy of the set of words in printed English as 11.82 bits per word. As this figure seems inconsistent with some results deduced from several encoding procedures, the entropy was recalculated and found to be roughly 9.8 bits per word.
Solution: (a) The maximum possible entropy of an alphabet consisting of N different letters is H = log2 N. This is only achieved if the probability of every letter is 1/N. Thus 1/N is the probability of both the “most likely” and the “least likely” letter.
The amount of information carried in the arrangement of words is the same across all languages, even languages that aren't related to each other. This consistency could hint at a single common ancestral language, or universal features of how human brains process speech.
If we define 'isolated symbol probabilities' in the way it's done in this SO answer, we would have to do the following:
Obtain a representative sample of English text (perhaps a carefully selected corpus of news articles, blog posts, some scientific articles and some personal letters), as large as possible
Iterate through its characters and count the frequency of occurrence of each of them
Use the frequency, divided by the total number of characters, as estimate for each character's probability
Calculate the average length in bits of each character by multiplying its probability with the negative logarithm of that same probability (the base-2 logarithm if we want the unit of entropy to be bit)
Take the sum of all average lengths of all characters. That is the result.
Caveats:
This isolated-symbols entropy is not what is usually referred to as Shannon's entropy estimate for English. Shannon based the entropy on conditional n-gram probabilities, rather than isolated symbols, and his famous 1950 paper is largely about how to determine the optimal n.
Most people who try to estimate the entropy of English exclude punctuation characters and normalise all text to lowercase.
The above assumes that a symbol is defined as a character (or letter) of English. You could do a similar thing for entire words, or other units of text.
Code example:
Here is some Python code that implements the procedure described above. It normalises the text to lowercase and excludes punctuation and any other non-alphabetic, non-whitespace character. It assumes that you have put together a representative corpus of English and provide it (encoded as ASCII) on STDIN.
import re
import sys
from math import log
# Function to compute the base-2 logarithm of a floating point number.
def log2(number):
return log(number) / log(2)
# Function to normalise the text.
cleaner = re.compile('[^a-z]+')
def clean(text):
return cleaner.sub(' ',text)
# Dictionary for letter counts
letter_frequency = {}
# Read and normalise input text
text = clean(sys.stdin.read().lower().strip())
# Count letter frequencies
for letter in text:
if letter in letter_frequency:
letter_frequency[letter] += 1
else:
letter_frequency[letter] = 1
# Calculate entropy
length_sum = 0.0
for letter in letter_frequency:
probability = float(letter_frequency[letter]) / len(text)
length_sum += probability * log2(probability)
# Output
sys.stdout.write('Entropy: %f bits per character\n' % (-length_sum))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With