I'm doing an artistic project where I want to see if any information emerges from a long string of characters (~28,000). It's sort of like the problem one faces in solving a Jumble. Here's a snippet:
jfifddcceaqaqbrcbdrstcaqaqbrcrisaxohvaefqiygjqotdimwczyiuzajrizbysuyuiathrevwdjxbinwajfgvlxvdpdckszkcyrlliqxsdpunnvmedjjjqrczrrmaaaipuzekpyqflmmymedvovsudctceccgexwndlgwaqregpqqfhgoesrsridfgnlhdwdbbwfmrrsmplmvhtmhdygmhgrjflfcdlolxdjzerqxubwepueywcamgtoifajiimqvychktrtsbabydqnmhcmjhddynrqkoaxeobzbltsuenewvjbstcooziubjpbldrslhmneirqlnpzdsxhyqvfxjcezoumpevmuwxeufdrrwhsmfirkwxfadceflmcmuccqerchkcwvvcbsxyxdownifaqrabyawevahiuxnvfbskivjbtylwjvzrnuxairpunskavvohwfblurcbpbrhapnoahhcqqwtqvmrxaxbpbnxgjmqiprsemraacqhhgjrwnwgcwcrghwvxmqxcqfpcdsrgfmwqvqntizmnvizeklvnngzhcoqgubqtsllvppnedpgtvyqcaicrajbmliasiayqeitcqtexcrtzacpxnbydkbnjpuofyfwuznkf
What's the most efficient way of searching for all possible English words embedded (both forwards and backwards) in this string?
What is a useful dictionary against which to check the substrings? Is there a good library for doing this sort of thing? I have searched around and found some interesting TRIE solutions; but most of them are dealing with the situation where you know the set of words in advance.
I used this solution to find all words forwards and backwards from a corpus of 28,000 random characters in a dictionary of 100,000 words in .5 seconds. It runs in O(n) time. It takes a file called "words.txt" which is a dictionary that has words separated by some kind of whitespace. I used the default unix wordlist in /usr/share/dict/words
but I'm sure you can find plenty of text file dictionaries online if not that one.
from random import choice
import string
dictionary = set(open('words.txt','r').read().lower().split())
max_len = max(map(len, dictionary)) #longest word in the set of words
text = ''.join([choice(string.ascii_lowercase) for i in xrange(28000)])
text += '-'+text[::-1] #append the reverse of the text to itself
words_found = set() #set of words found, starts empty
for i in xrange(len(text)): #for each possible starting position in the corpus
chunk = text[i:i+max_len+1] #chunk that is the size of the longest word
for j in xrange(1,len(chunk)+1): #loop to check each possible subchunk
word = chunk[:j] #subchunk
if word in dictionary: #constant time hash lookup if it's in dictionary
words_found.add(word) #add to set of words
print words_found
Here is a bisection/binary search that should be usefull.
def isaprefix(frag, wordlist, first, last):
"""
Recursive binary search of wordlist for words that start with frag.
assumes wordlist is a sorted list
typically called with first = 0 and last = len(wordlist)
first,last -->> integer
returns bool
"""
# base case - down to two elements
if (last - first) < 2:
# return False unless frag is a prefix
# of either of the two remaining words
return wordlist[first].startswith(frag) or wordlist[last].startswith(frag)
#mid = (first + last)/2
midword = wordlist[(first + last) / 2]
# go ahead and return if you find one
# a second base case?
if midword.startswith(frag):
return True
#print word, ' - ', wordlist[mid], ' - ', wordlist[mid][:len(word)], ' - ', isprefix
# start the tests
# python does just fine comparing strings
if frag < midword:
# set the limits to the lower half
# of the previous range searched and recurse
return isaprefix(frag, wordlist, first, mid-1)
# frag is > midword: set the limits to the upper half
# of the previous range searched and recurse
return isaprefix(frag, wordlist, mid+1, last)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With