Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find possible English words in long random string?

I'm doing an artistic project where I want to see if any information emerges from a long string of characters (~28,000). It's sort of like the problem one faces in solving a Jumble. Here's a snippet:

jfifddcceaqaqbrcbdrstcaqaqbrcrisaxohvaefqiygjqotdimwczyiuzajrizbysuyuiathrevwdjxbinwajfgvlxvdpdckszkcyrlliqxsdpunnvmedjjjqrczrrmaaaipuzekpyqflmmymedvovsudctceccgexwndlgwaqregpqqfhgoesrsridfgnlhdwdbbwfmrrsmplmvhtmhdygmhgrjflfcdlolxdjzerqxubwepueywcamgtoifajiimqvychktrtsbabydqnmhcmjhddynrqkoaxeobzbltsuenewvjbstcooziubjpbldrslhmneirqlnpzdsxhyqvfxjcezoumpevmuwxeufdrrwhsmfirkwxfadceflmcmuccqerchkcwvvcbsxyxdownifaqrabyawevahiuxnvfbskivjbtylwjvzrnuxairpunskavvohwfblurcbpbrhapnoahhcqqwtqvmrxaxbpbnxgjmqiprsemraacqhhgjrwnwgcwcrghwvxmqxcqfpcdsrgfmwqvqntizmnvizeklvnngzhcoqgubqtsllvppnedpgtvyqcaicrajbmliasiayqeitcqtexcrtzacpxnbydkbnjpuofyfwuznkf

What's the most efficient way of searching for all possible English words embedded (both forwards and backwards) in this string?

What is a useful dictionary against which to check the substrings? Is there a good library for doing this sort of thing? I have searched around and found some interesting TRIE solutions; but most of them are dealing with the situation where you know the set of words in advance.

like image 243
todkwxrtvwmzonunswam Avatar asked Oct 12 '13 19:10

todkwxrtvwmzonunswam


2 Answers

I used this solution to find all words forwards and backwards from a corpus of 28,000 random characters in a dictionary of 100,000 words in .5 seconds. It runs in O(n) time. It takes a file called "words.txt" which is a dictionary that has words separated by some kind of whitespace. I used the default unix wordlist in /usr/share/dict/words but I'm sure you can find plenty of text file dictionaries online if not that one.

from random import choice
import string

dictionary = set(open('words.txt','r').read().lower().split())
max_len = max(map(len, dictionary)) #longest word in the set of words

text = ''.join([choice(string.ascii_lowercase) for i in xrange(28000)])
text += '-'+text[::-1] #append the reverse of the text to itself

words_found = set() #set of words found, starts empty
for i in xrange(len(text)): #for each possible starting position in the corpus
    chunk = text[i:i+max_len+1] #chunk that is the size of the longest word
    for j in xrange(1,len(chunk)+1): #loop to check each possible subchunk
        word = chunk[:j] #subchunk
        if word in dictionary: #constant time hash lookup if it's in dictionary
            words_found.add(word) #add to set of words

print words_found
like image 132
GrantS Avatar answered Oct 24 '22 07:10

GrantS


Here is a bisection/binary search that should be usefull.

def isaprefix(frag, wordlist, first, last):
    """
    Recursive binary search of wordlist for words that start with frag.

    assumes wordlist is a sorted list
    typically called with first = 0 and last = len(wordlist)

    first,last -->> integer
    returns bool
    """

    # base case - down to two elements
    if (last - first) < 2:
        # return False unless frag is a prefix
        # of either of the two remaining words
        return wordlist[first].startswith(frag) or wordlist[last].startswith(frag)

    #mid = (first + last)/2
    midword = wordlist[(first + last) / 2]

    # go ahead and return if you find one
    # a second base case?
    if midword.startswith(frag):
        return True

    #print word, ' - ', wordlist[mid], ' - ', wordlist[mid][:len(word)], ' - ', isprefix
    # start the tests
    # python does just fine comparing strings
    if frag < midword:
        # set the limits to the lower half
        # of the previous range searched and recurse
        return isaprefix(frag, wordlist, first, mid-1)

    # frag is > midword: set the limits to the upper half
    # of the previous range searched and recurse
    return isaprefix(frag, wordlist, mid+1, last)
like image 34
wwii Avatar answered Oct 24 '22 08:10

wwii