Ultimately, I want to find out which word in the English dictionary contains the most subwords that are at least three letters. I wrote this algorithm, but it's too slow to be useful. Wondering ways I can optimize it
def subWords(word):
return set((word[0:i] for i in range(2, len(word)+1))) #returns all subWords of length 2 or greater
def checkDict(wordList, dictList):
return set((word for word in wordList if word in dictList))
def main():
dictList = [i.strip() for i in open('wordlist.txt').readlines()]
allwords = list()
maximum = (0, list())
for dictWords in dictList:
for i in range (len(dictWords)):
for a in checkDict(subWords(dictWords[i: len(dictWords) + 1]), dictList):
allwords.append(a)
if len(allwords) > maximum[0]:
maximum = (len(allwords), allwords)
print maximum
allwords = list()
print maximum
main()
The major weakness of your algorithm is that, for every subword, you need to compare it to every other word in the dictionary. You don't need to do that, really - if your word begins with 'a', you don't really need to see if it matches words that begin with 'b'. If the next letter is 'c', then you don't really care to compare it to words that begin with 'd'. The question then becomes: "How do I implement this idea efficiently?"
To do this, we can create a tree to represent all of the words in the dictionary. We construct this tree by taking each word in the dictionary and extending the tree with it, and shading in the last node.
When we want to test if a subword is in this tree, we just go through that word letter by letter and use those letters to determine where to go next in the tree (starting at the top). If we find that we have nowhere to go, or that we land on a non-shaded tree node after going through the whole subword, then it's not a word. Otherwise, it is a word if we land on a shaded node. The effect of this is that we can search the entire dictionary at once, rather than one word at a time. The cost of this is, of course, a bit of set-up at the start, but that's not a great price to pay if you have a lot of words in the dictionary.
Well that's all pretty fantastic! Let's try implementing it:
class Node:
def __init__( self, parent, valid_subword ):
self.parent = parent
self.valid_subword = valid_subword
self.children = {}
#Extend the tree with a new node
def extend( self, transition, makes_valid_word ):
next_node = None
if transition in self.children:
if makes_valid_word:
self.children[transition].makes_valid_word = True
else:
self.children[transition] = Node( self, makes_valid_word )
return self.children[transition]
def generateTree( allwords ):
tree = Node( None, False )
for word in allwords:
makes_valid_word = False
current_node = tree
for i in range(len(word)):
current_node = current_node.extend( word[i], True if i == len(word) - 1 else False )
return tree
def checkDict( word, tree ):
current_node = tree
for letter in word:
try:
current_node = current_node.children[letter]
except KeyError:
return False
return current_node.valid_subword
Then, later-on:
for word in allWords:
for subword in subWords(word):
checkDict(subword)
#Code to keep track of the number of words found, like you already have
This algorithm allows you to check whether or not a word is in your dictionary in O(m) time, where m is the length of the longest word in the dictionary. Notice that this remains roughly constant for a dictionary containing an arbitrary number of words. Your original algorithm was O(n) per check, where n is the number of words in the dictionary.
1) Style and organization: it makes more sense to have a single function that generates all the subword of a word.
2) Style: the double parentheses are not needed to use set
.
3) Performance (I hope): make a set
from the words you're looking up, too; then you can use the built-in set intersection check.
4) Performance (almost certainly): don't do a manual loop to find the maximum element; use the built-in max
. You can compare (length, elements) tuples directly; Python compares tuples by each pair of elements from beginning to last, as if each element were a letter in a string.
5) Performance (maybe): Ensure there are no 1- or 2-letter words in the dictionary to begin with, since they just get in the way.
6) Performance (sadly true): don't break everything up into a function.
7) Style: File I/O should use a with
block to ensure proper cleanup of resources, and file iterators iterate over lines by default, so we can get a list of lines implicitly instead of having to call .readlines()
.
I end up with (not properly tested, except the 'fragments' expression):
def countedSubWords(word, dictionary):
fragments = set(
word[i:j]
for i in range(len(word)) for j in range(i+3, len(word)+1)
)
subWords = fragments.intersection(dictionary)
return (len(subWords), subWords)
def main():
with open('wordlist.txt') as words:
dictionary = set(word.strip() for word in words if len(word.strip()) > 2)
print max(countedSubWords(word, dictionary) for word in dictionary)
To explore basic Python, have a look at this function (basically a faster, polished, PEP8-happy version of what JBernardo and Karl Knechtel suggested):
def check_dict(word, dictionary):
"""Return all subwords of `word` that are in `dictionary`."""
fragments = set(word[i:j]
for i in xrange(len(word) - 2)
for j in xrange(i + 3, len(word) + 1))
return fragments & dictionary
dictionary = frozenset(word for word in word_list if len(word) >= 3)
print max(((word, check_dict(word, dictionary)) for word in dictionary),
key=lambda (word, subwords): len(subwords)) # max = the most subwords
Outputs something like:
('greatgrandmothers',
set(['and', 'rand', 'great', 'her', 'mothers', 'moth', 'mother', 'others', 'grandmothers', 'grandmother', 'ran', 'other', 'greatgrandmothers', 'greatgrandmother', 'grand', 'hers', 'the', 'eat']))
for the word list from http://www.mieliestronk.com/wordlist.html.
Now I know you're not going for performance (the code above is already <1s for the standard English vocab of 58k words).
But in case you'd need to run this super fast, say in some inner loop :)
check_dict
on the heap, that's the main performance killer.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With