Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an algorithm to find the Shannon entropy for text?

Tags:

text

algorithm

I know the Shannon entropy for English is 1.0 to 1.5 bits per letter and some say as low as 0.6 to 1.3 bits per letter but I was was wondering is there a way to run an algorithm that looks at a large volume of text and then determine the expected value of the collective text is say .08 bits per letter of the collective text?

like image 988
Polo Montana Avatar asked Apr 08 '12 21:04

Polo Montana


People also ask

How do you find the entropy of a text file?

To compute Entropy the frequency of occurrence of each character must be found out. The probability of occurrence of each character can therefore be found out by dividing each character frequency value by the length of the string message.

How do you find the Shannon entropy of a string?

The Shannon entropy of a string is simply the negative of the sum of the products of all the p_i * log(p_i) terms, where p_i is the proportion of the string made up of the i'th character and log() is the logarithm to base-2.

What is entropy of text?

Entropy of a language is a statistical parameter which measures, in a certain sense, how much information is produced on the average for each letter of a text in a language. When compressing the text, the letters of the text must be translated into binary digits 0 or 1.


1 Answers

The mathematical definition of the entropy rate of a language is, if you have a source that generates strings in that language, the limit of the entropy of the nth symbol, conditioned on the n-1 previous ones (assuming that the source is stationary).

A good enough approximation of such a source is a large corpus of English text. The Open national american corpus is pretty nice (100M characters, covers all types of written texts). Then the basic algorithm to approximate the limit above is to look, for a given n, at all n-grams that appear in the text, and build a statistical estimate of the various probabilities that occur in the definition of the conditional entropies involved in the calculation of the entropy rate.

The full source code to do it is short and simple (~40 lines of python code). I've done a blog post about estimating the entropy rate of English recently that goes into much more details, including mathematical definitions and a full implementation. It also includes references to various relevant papers, including Shannon's original article.

like image 175
Clément Avatar answered Nov 16 '22 00:11

Clément