How to find out the entropy of the English language by using isolated symbol probabilities of the language?

If we define 'isolated symbol probabilities' in the way it's done in this SO answer, we would have to do the following: <ol> <li>Obtain a representative sample of English text (perhaps a carefully selected corpus of news articles, blog posts, some scientific articles and some personal letters), as large as possible</li> <li>Iterate through its characters and count the frequency of occurrence of each of them</li> <li>Use the frequency, divided by the total number of characters, as estimate for each character's probability</li> <li>Calculate the average length in bits of each character by multiplying its probability with the negative logarithm of that same probability (the base-2 logarithm if we want the unit of entropy to be bit)</li> <li>Take the sum of all average lengths of all characters. That is the result.</li> </ol> Caveats: <ul> <li>This isolated-symbols entropy is not what is usually referred to as Shannon's entropy estimate for English. Shannon based the entropy on conditional n-gram probabilities, rather than isolated symbols, and his famous 1950 paper is largely about how to determine the optimal n.</li> <li>Most people who try to estimate the entropy of English exclude punctuation characters and normalise all text to lowercase.</li> <li>The above assumes that a symbol is defined as a character (or letter) of English. You could do a similar thing for entire words, or other units of text.</li> </ul> Code example: Here is some Python code that implements the procedure described above. It normalises the text to lowercase and excludes punctuation and any other non-alphabetic, non-whitespace character. It assumes that you have put together a representative corpus of English and provide it (encoded as ASCII) on STDIN. <pre class="prettyprint"><code>import re import sys from math import log # Function to compute the base-2 logarithm of a floating point number. def log2(number): return log(number) / log(2) # Function to normalise the text. cleaner = re.compile('[^a-z]+') def clean(text): return cleaner.sub(' ',text) # Dictionary for letter counts letter_frequency = {} # Read and normalise input text text = clean(sys.stdin.read().lower().strip()) # Count letter frequencies for letter in text: if letter in letter_frequency: letter_frequency[letter] += 1 else: letter_frequency[letter] = 1 # Calculate entropy length_sum = 0.0 for letter in letter_frequency: probability = float(letter_frequency[letter]) / len(text) length_sum += probability * log2(probability) # Output sys.stdout.write('Entropy: %f bits per character\n' % (-length_sum)) </code></pre>

How to find out the entropy of the English language

1 Answers

If we define 'isolated symbol probabilities' in the way it's done in this SO answer, we would have to do the following:

Obtain a representative sample of English text (perhaps a carefully selected corpus of news articles, blog posts, some scientific articles and some personal letters), as large as possible
Iterate through its characters and count the frequency of occurrence of each of them
Use the frequency, divided by the total number of characters, as estimate for each character's probability
Calculate the average length in bits of each character by multiplying its probability with the negative logarithm of that same probability (the base-2 logarithm if we want the unit of entropy to be bit)
Take the sum of all average lengths of all characters. That is the result.

Caveats:

This isolated-symbols entropy is not what is usually referred to as Shannon's entropy estimate for English. Shannon based the entropy on conditional n-gram probabilities, rather than isolated symbols, and his famous 1950 paper is largely about how to determine the optimal n.
Most people who try to estimate the entropy of English exclude punctuation characters and normalise all text to lowercase.
The above assumes that a symbol is defined as a character (or letter) of English. You could do a similar thing for entire words, or other units of text.

Code example:

Here is some Python code that implements the procedure described above. It normalises the text to lowercase and excludes punctuation and any other non-alphabetic, non-whitespace character. It assumes that you have put together a representative corpus of English and provide it (encoded as ASCII) on STDIN.

import re
import sys
from math import log

# Function to compute the base-2 logarithm of a floating point number.
def log2(number):
    return log(number) / log(2)

# Function to normalise the text.
cleaner = re.compile('[^a-z]+')
def clean(text):
    return cleaner.sub(' ',text)

# Dictionary for letter counts
letter_frequency = {}

# Read and normalise input text
text = clean(sys.stdin.read().lower().strip())

# Count letter frequencies
for letter in text:
    if letter in letter_frequency:
        letter_frequency[letter] += 1
    else:
        letter_frequency[letter] = 1

# Calculate entropy
length_sum = 0.0
for letter in letter_frequency:
    probability = float(letter_frequency[letter]) / len(text)
    length_sum += probability * log2(probability)

# Output
sys.stdout.write('Entropy: %f bits per character\n' % (-length_sum))

171

answered Jan 31 '23 17:01

jogojapan

Related questions
                            
                                extract relationships using NLTK
                            
                                Is there an easy way generate a probable list of words from an unspaced sentence in python?
                            
                                Natural Language Processing in PHP
                            
                                what is the MeCab output and the tagset?
                            
                                NLTK Context Free Grammar Genaration
                            
                                Methods for extracting locations from text?
                            
                                How to generate a list of antonyms for adjectives in WordNet using Python
                            
                                spaCy 2.0: Save and Load a Custom NER model
                            
                                Ways to store and access large (~10 GB) lists in Python?
                            
                                Extracting a part of a Spacy document as a new document
                            
                                Disabling part of the nlp pipeline
                            
                                Is there a search engine that will give a direct answer? [closed]
                            
                                Render linguistic syntax tree in browser
                            
                                Regex add character to matched string
                            
                                NLTK. Find if a sentence is in a questioning form
                            
                                How to get domain of words using WordNet in Python?
                            
                                Optimizer and scheduler for BERT fine-tuning
                            
                                Given a document, select a relevant snippet
                            
                                How to find out wether a word exists in english using nltk
                            
                                How to find "num_words" or vocabulary size of Keras tokenizer when one is not assigned?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to find out the entropy of the English language

Tags:

nlp

entropy

wcwchamara

People also ask

1 Answers

jogojapan

Recent Activity

Donate For Us