Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How whether a string is randomly generated or plausibly an English word?

I have a corpus of text which contains some strings. In these strings, some are English words, some are random such as VmsVKmGMY6eQE4eMI, there are no limit on the number of characters in each string.

Is there any way to test whether or not one string is a English word? I am looking for some kind of algorithm that does the job. This is in Java, and I rather not to implement an extra dictionary.

like image 342
ikel Avatar asked Feb 11 '14 23:02

ikel


3 Answers

If you mean some kind of a rule of a thumb that distinguishes english word from random text, there is none. For reasonable accuracy you will need to query an external source, whether it's the Web, dictionary, or a service.

If you only need to check for an existence of the word, I would suggest Wordnet. It is pretty simple to use and there is a nice Java API for it called JWNL, that makes querying Wordnet dictionary a breeze.

like image 153
Warlord Avatar answered Oct 08 '22 22:10

Warlord


I had to solve a closely related problem for a source code mining project, and although the package is written in Python and not Java, it seemed worth mentioning here in case it can still be useful somehow. The package is Nostril (for "Nonsense String Evaluator") and it is aimed at determining whether strings extracted during source-code mining are likely to be class/function/variable/etc. identifiers or random gibberish. Nostril does not use a dictionary, but it does incorporate a rather large table of n-gram frequencies to support its probabilistic assessment of text strings.

Example: the following code,

from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
             'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
    print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))

will produce the following output:

bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense

The project is on GitHub and I welcome contributions. If you really need a Java implementation, perhaps we can make Nostril compatible with Python 2.7 and you can try to use Jython to run it from Java.

like image 43
mhucka Avatar answered Oct 08 '22 22:10

mhucka


If you want to differentiate things that are word-like but possibly not popular enough to be in a dictionary from gibberish/random text, it's not actually that hard. You should see my answer to this question. Is there any way to detect strings like putjbtghguhjjjanika?. It contains an implementation Python and PHP.

like image 34
Rob Neuhaus Avatar answered Oct 08 '22 23:10

Rob Neuhaus