I know the Shannon entropy for English is 1.0 to 1.5 bits per letter and some say as low as 0.6 to 1.3 bits per letter but I was was wondering is there a way to run an algorithm that looks at a large volume of text and then determine the expected value of the collective text is say .08 bits per letter of the collective text?
To compute Entropy the frequency of occurrence of each character must be found out. The probability of occurrence of each character can therefore be found out by dividing each character frequency value by the length of the string message.
The Shannon entropy of a string is simply the negative of the sum of the products of all the p_i * log(p_i) terms, where p_i is the proportion of the string made up of the i'th character and log() is the logarithm to base-2.
Entropy of a language is a statistical parameter which measures, in a certain sense, how much information is produced on the average for each letter of a text in a language. When compressing the text, the letters of the text must be translated into binary digits 0 or 1.
The mathematical definition of the entropy rate of a language is, if you have a source that generates strings in that language, the limit of the entropy of the nth symbol, conditioned on the n-1 previous ones (assuming that the source is stationary).
A good enough approximation of such a source is a large corpus of English text. The Open national american corpus is pretty nice (100M characters, covers all types of written texts). Then the basic algorithm to approximate the limit above is to look, for a given n, at all n-grams that appear in the text, and build a statistical estimate of the various probabilities that occur in the definition of the conditional entropies involved in the calculation of the entropy rate.
The full source code to do it is short and simple (~40 lines of python code). I've done a blog post about estimating the entropy rate of English recently that goes into much more details, including mathematical definitions and a full implementation. It also includes references to various relevant papers, including Shannon's original article.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With