In python, how can I distinguish between a human readable word and a random string?

Question

Examples of words:

ball
encyclopedia
tableau

Examples of random strings:

qxbogsac
jgaynj
rnnfdwpm

Of course it may happen that a random string will actually be a word in some language or look like one. But basically a human being is able to say it something looks 'random' or not, basically just by checking if you are able to pronounce it or not.

I was trying to calculate entropy to distinguish those two but it's far from perfect. Do you have any other ideas, algorithms that works?

There is one important requirement though, I can't use heavy-weight libraries like nltk or use dictionaries. Basically what I need is some simple and quick heuristic that works in most cases.

mhucka · Accepted Answer

I developed a Python 3 package called Nostril for a problem closely related to what the OP asked: deciding whether text strings extracted during source-code mining are class/function/variable/etc. identifiers or random gibberish. It does not use a dictionary, but it does incorporate a rather large table of n-gram frequencies to support its probabilistic assessment of text strings. (I'm not sure if that qualifies as a "dictionary".) The approach does not check pronunciation, and its specialization may make it unsuitable for general word/nonword detection; nevertheless, perhaps it will be useful for either the OP or someone else looking to solve a similar problem.

Example: the following code,

from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
             'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
    print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))

will produce the following output:

bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense

Abhijit · Answer

Caveat I am not a Natural Language Expert

Assuming what ever mentioned in the link If You Can Raed Tihs, You Msut Be Raelly Smrat is authentic, a simple approach would be

Have an English (I believe its language antagonistic) dictionary

Create a python dict of the words, with keys as the first and last character of the words in the dictionary

words = defaultdict()
with open("your_dict.txt") as fin:
     for word in fin:
        words[word[0]+word[-1]].append(word)

Now for any given word, search the dictionary (remember key is the first and last character of the word)
```
for matches in words[needle[0] + needle[-1]]:
```

Compare if the characters in the value of the dictionary and your needle matches

for match in words[needle[0] + needle[-1]]:
    if sorted(match) == sorted(needle):
         print "Human Readable Word"

A comparably slower approach would be to use difflib.get_close_matches(word, possibilities[, n][, cutoff])

In python, how can I distinguish between a human readable word and a random string?

Tags:

python

string

random

nlp

heuristics

mnowotka

Video Answer

2 Answers

mhucka

Abhijit

Recent Activity

Donate For Us

In python, how can I distinguish between a human readable word and a random string?

Tags:

python

string

random

nlp

heuristics

mnowotka

Video Answer

2 Answers

mhucka

Abhijit

Related questions

Recent Activity

Donate For Us