Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scoring a string based on how English-like it is

I'm not sure how exactly to word this question, so here's an example:

string1 = "THEQUICKBROWNFOX" string2 = "KLJHQKJBKJBHJBJLSDFD"

I want a function that would score string1 higher than string2 and a million other gibberish strings. Note the lack of spaces, so this is a character-by-character function, not word-by-word.

In the 90s I wrote a trigram-scoring function in Delphi and populated it with trigrams from Huck Finn, and I'm considering porting the code to C or Python or kludging it into a stand-alone tool, but there must be more efficient ways by now. I'll be doing this millions of times, so speed is nice. I tried the Reverend.Thomas Beyse() python library and trained it with some all-caps-strings, but it seems to require spaces between words and thus returns a score of []. I found some Markov Chain libraries, but they also seemed to require spaces between words. Though from my understanding of them, I don't see why that should be the case...

Anyway, I do a lot of cryptanalysis, so in the future scoring functions that use spaces and punctuation would be helpful, but right now I need just ALLCAPITALLETTERS.

Thanks for the help!

like image 261
Seth Avatar asked Jul 29 '11 19:07

Seth


2 Answers

I would start with a simple probability model for how likely each letter is, given the previous (possibly-null, at start-of-word) letter. You could build this based on a dictionary file. You could then expand this to use 2 or 3 previous letters as context to condition the probabilities if the initial model is not good enough. Then multiply all the probabilities to get a score for the word, and possibly take the Nth root (where N is the length of the string) if you want to normalize the results so you can compare words of different lengths.

like image 60
R.. GitHub STOP HELPING ICE Avatar answered Oct 03 '22 06:10

R.. GitHub STOP HELPING ICE


I don't see why a Markov chain couldn't be modified to work. I would create a text file dictionary of sorts, and read that in to initially populate the data structure. You would just be using a chain of n letters to predict the next letter, rather than n words to predict the next word. Then, rather than randomly generating a letter, you would likely want to pull out the probability of the next letter. For instance if you had the current chain of "TH" and the next letter was "E", you would go to your map, and see the probability that an "E" would follow "TH". Personally I would simply add up all of these probabilities while looping through the string, but how to exactly create a score from the probability is up to you. You could normalize it for string length, to let you compare short and long strings.

Now that I think about it, this method would favor strings with longer words, since a dictionary would not include phrases. Then again, you could populate the dictionary with not only single words, but short phrases with the spaces removed as well. Then the scoring would not only score based on how english the seperate words are, but how english series of words are. It's not a perfect system, but it would provide consistent scoring.

like image 39
Godric Seer Avatar answered Oct 03 '22 05:10

Godric Seer