Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the entropy of a string of English text signify low quality?

Jeff Atwood recently tweeted a link to a CodeReview post where he wanted to know if the community could improve his "calculating entropy of a string" code snippet. He explained, "We're calculating entropy of a string a few places in Stack Overflow as a signifier of low quality."

The gist of his method seemed to be that if you count the number of unique characters in a string, that signifies entropy (code taken from PieterG's answer):

int uniqueCharacterCount = string.Distinct().Count();

I don't understand how the unique character count signifies entropy of a string, and how the entropy of a string signifies low quality. I was wondering if someone with more knowledge in this area could explain what Mr. Atwood is trying to accomplish.

Thanks!

like image 317
Pandincus Avatar asked Feb 22 '11 16:02

Pandincus


People also ask

What does entropy of a text mean?

Entropy of a language is a statistical parameter which measures, in a certain sense, how much information is produced on the average for each letter of a text in a language. When compressing the text, the letters of the text must be translated into binary digits 0 or 1.

What is the entropy of English language?

The entropy of letters in the English language is 4.11 bits 12] (which is less than log226 = 4:7 bits).

What is entropy of a string?

It is a measure of the information content of the string, and can be interpreted as the number of bits required to encode each character of the string given perfect compression. The entropy is maximal when each character is equally likely.

How do you find the entropy of a text?

To compute Entropy the frequency of occurrence of each character must be found out. The probability of occurrence of each character can therefore be found out by dividing each character frequency value by the length of the string message.


Video Answer


2 Answers

The confusion seems to be from the idea that this is used to block posts from being posted - it's not.

It is just one of several algorithms used to find possible low-quality posts, displayed on the low quality posts tab (requires 10k rep) of the moderator tools. Actual humans still need to look at the post.

The idea is to catch posts like ~~~~~~No.~~~~~~ or FUUUUUUUU------, not to catch all low-quality posts.


As for "How does the unique character-count signify entropy?" - it doesn't, really. The most upvoted answers completely miss the point.

See https://codereview.stackexchange.com/questions/868#878 and https://codereview.stackexchange.com/questions/868#926

like image 147
BlueRaja - Danny Pflughoeft Avatar answered Sep 24 '22 16:09

BlueRaja - Danny Pflughoeft


String 'aaaaaaaaaaaaaaaaaaaaaaaaaaa' has very low entropy, and is rather meaningless.

String 'blah blah blah blah blah blah blah blah' has a bit higher entropy, but is still rather silly and can be a part of an attack.

A post or a comment that has entropy comparable to these strings is probably not appropriate; it can't contain any meaningful message, even a spam link. Such a post can be just filtered out or warrant an additional captcha.

like image 35
9000 Avatar answered Sep 24 '22 16:09

9000