Is there an algorithm to find the Shannon entropy for text?

Tags:

I know the Shannon entropy for English is 1.0 to 1.5 bits per letter and some say as low as 0.6 to 1.3 bits per letter but I was was wondering is there a way to run an algorithm that looks at a large volume of text and then determine the expected value of the collective text is say .08 bits per letter of the collective text?

988

asked Apr 08 '12 21:04

Polo Montana

1 Answers

The mathematical definition of the entropy rate of a language is, if you have a source that generates strings in that language, the limit of the entropy of the n^th symbol, conditioned on the n-1 previous ones (assuming that the source is stationary).

A good enough approximation of such a source is a large corpus of English text. The Open national american corpus is pretty nice (100M characters, covers all types of written texts). Then the basic algorithm to approximate the limit above is to look, for a given n, at all n-grams that appear in the text, and build a statistical estimate of the various probabilities that occur in the definition of the conditional entropies involved in the calculation of the entropy rate.

The full source code to do it is short and simple (~40 lines of python code). I've done a blog post about estimating the entropy rate of English recently that goes into much more details, including mathematical definitions and a full implementation. It also includes references to various relevant papers, including Shannon's original article.

175

answered Nov 16 '22 00:11

Clément

Related questions
                            
                                What type of heap is used and time complexity of std::priority_queue in c++? [duplicate]
                            
                                Problems with implementing Lucas–Lehmer primality test
                            
                                Why can pointers to non-static member functions not be used as a unary predicate for standard library algorithms?
                            
                                Software implementation of floating point division, issues with rounding
                            
                                How to get two lists that have the most elements in common in a nested list in Python
                            
                                Count number of pairs of nodes in undirected graph such that W - L >= K
                            
                                Fast text editor find
                            
                                Algorithm to determine indices i..j of array A containing all the elements of another array B
                            
                                Improving scalability of the modified preorder tree traversal algorithm
                            
                                What is the technology behind bing? Its own version of map-reduce algorithm or something else?
                            
                                Algorithm to generate a segment maze
                            
                                Java code optimization leads to numerical inaccuracies and errors
                            
                                Branchless Binary Search
                            
                                A Dynamic programming problem
                            
                                Difference in accuracy with floating point division vs multiplication
                            
                                Valued permutation
                            
                                Why the relative efficiency of these routines in Mathematica?
                            
                                existential search and query without the fuss
                            
                                Efficiently find all sets S from a collection of sets C that are contained in a set D
                            
                                Fast solution to Subset sum algorithm by Pisinger

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there an algorithm to find the Shannon entropy for text?

Tags:

text

algorithm

Polo Montana

People also ask

1 Answers

Clément

Recent Activity

Donate For Us