Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python- find all subwords that can be found inside of a word

Ultimately, I want to find out which word in the English dictionary contains the most subwords that are at least three letters. I wrote this algorithm, but it's too slow to be useful. Wondering ways I can optimize it

def subWords(word):
    return set((word[0:i] for i in range(2, len(word)+1))) #returns all subWords of length 2 or greater

def checkDict(wordList, dictList):
    return set((word for word in wordList if word in dictList))

def main():
    dictList = [i.strip() for i in open('wordlist.txt').readlines()]
    allwords = list()
    maximum = (0, list())

    for dictWords in dictList:
        for i in range (len(dictWords)):
            for a in checkDict(subWords(dictWords[i: len(dictWords) + 1]), dictList):
                allwords.append(a)

        if len(allwords) > maximum[0]:
            maximum = (len(allwords), allwords)

        print maximum
        allwords = list()

    print maximum 
main()
like image 320
Parseltongue Avatar asked Aug 03 '11 21:08

Parseltongue


3 Answers

The major weakness of your algorithm is that, for every subword, you need to compare it to every other word in the dictionary. You don't need to do that, really - if your word begins with 'a', you don't really need to see if it matches words that begin with 'b'. If the next letter is 'c', then you don't really care to compare it to words that begin with 'd'. The question then becomes: "How do I implement this idea efficiently?"

To do this, we can create a tree to represent all of the words in the dictionary. We construct this tree by taking each word in the dictionary and extending the tree with it, and shading in the last node.

Example Tree

When we want to test if a subword is in this tree, we just go through that word letter by letter and use those letters to determine where to go next in the tree (starting at the top). If we find that we have nowhere to go, or that we land on a non-shaded tree node after going through the whole subword, then it's not a word. Otherwise, it is a word if we land on a shaded node. The effect of this is that we can search the entire dictionary at once, rather than one word at a time. The cost of this is, of course, a bit of set-up at the start, but that's not a great price to pay if you have a lot of words in the dictionary.

Well that's all pretty fantastic! Let's try implementing it:

class Node:
    def __init__( self, parent, valid_subword ):
        self.parent = parent
        self.valid_subword = valid_subword
        self.children = {}

    #Extend the tree with a new node
    def extend( self, transition, makes_valid_word ):
        next_node = None
        if transition in self.children:
            if makes_valid_word:
                self.children[transition].makes_valid_word = True
        else:
            self.children[transition] = Node( self, makes_valid_word )
        return self.children[transition]

def generateTree( allwords ):
  tree = Node( None, False )
    for word in allwords:
      makes_valid_word = False
      current_node = tree
      for i in range(len(word)):
        current_node = current_node.extend( word[i], True if i == len(word) - 1 else False )
  return tree

def checkDict( word, tree ):
    current_node = tree
    for letter in word:
        try:
            current_node = current_node.children[letter]
        except KeyError:
            return False

    return current_node.valid_subword

Then, later-on:

for word in allWords:
  for subword in subWords(word):
    checkDict(subword)
    #Code to keep track of the number of words found, like you already have

This algorithm allows you to check whether or not a word is in your dictionary in O(m) time, where m is the length of the longest word in the dictionary. Notice that this remains roughly constant for a dictionary containing an arbitrary number of words. Your original algorithm was O(n) per check, where n is the number of words in the dictionary.

like image 179
Slubb Avatar answered Nov 05 '22 15:11

Slubb


1) Style and organization: it makes more sense to have a single function that generates all the subword of a word.

2) Style: the double parentheses are not needed to use set.

3) Performance (I hope): make a set from the words you're looking up, too; then you can use the built-in set intersection check.

4) Performance (almost certainly): don't do a manual loop to find the maximum element; use the built-in max. You can compare (length, elements) tuples directly; Python compares tuples by each pair of elements from beginning to last, as if each element were a letter in a string.

5) Performance (maybe): Ensure there are no 1- or 2-letter words in the dictionary to begin with, since they just get in the way.

6) Performance (sadly true): don't break everything up into a function.

7) Style: File I/O should use a with block to ensure proper cleanup of resources, and file iterators iterate over lines by default, so we can get a list of lines implicitly instead of having to call .readlines().

I end up with (not properly tested, except the 'fragments' expression):

def countedSubWords(word, dictionary): 
  fragments = set(
    word[i:j]
    for i in range(len(word)) for j in range(i+3, len(word)+1)
  )
  subWords = fragments.intersection(dictionary)
  return (len(subWords), subWords)


def main():
  with open('wordlist.txt') as words:
    dictionary = set(word.strip() for word in words if len(word.strip()) > 2)
    print max(countedSubWords(word, dictionary) for word in dictionary)
like image 6
Karl Knechtel Avatar answered Nov 05 '22 14:11

Karl Knechtel


To explore basic Python, have a look at this function (basically a faster, polished, PEP8-happy version of what JBernardo and Karl Knechtel suggested):

def check_dict(word, dictionary): 
  """Return all subwords of `word` that are in `dictionary`."""
  fragments = set(word[i:j] 
                  for i in xrange(len(word) - 2) 
                  for j in xrange(i + 3, len(word) + 1))
  return fragments & dictionary

dictionary = frozenset(word for word in word_list if len(word) >= 3)
print max(((word, check_dict(word, dictionary)) for word in dictionary), 
          key=lambda (word, subwords): len(subwords)) # max = the most subwords

Outputs something like:

('greatgrandmothers',
set(['and', 'rand', 'great', 'her', 'mothers', 'moth', 'mother', 'others', 'grandmothers', 'grandmother', 'ran', 'other', 'greatgrandmothers', 'greatgrandmother', 'grand', 'hers', 'the', 'eat']))

for the word list from http://www.mieliestronk.com/wordlist.html.


Now I know you're not going for performance (the code above is already <1s for the standard English vocab of 58k words).

But in case you'd need to run this super fast, say in some inner loop :)

  • You'd want to avoid creating copies of all the substrings inside check_dict on the heap, that's the main performance killer.
  • You could do that by pointer arithmetics, representing the substring by pointer delimiters only (as opposed to a full object).
  • Use that substring to quickly determine whether it is a part of the valid vocabulary:
    • use the trie data structure, or its memory-friendly version PATRICIA tree
    • build the static trie once, from your dictionary, then do fast substring lookups
    • incrementally shuffle the pointers to explore all possible substrings, increasing a hit counter for valid words
    • this way you avoid any dynamic allocations (no strings, no sets), blazing fast!!
  • All this is not very relevant in Python, because such memory management is too low-level and you'd be better off not using Python for performance critical code anyway.
like image 3
Radim Avatar answered Nov 05 '22 14:11

Radim