Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String Algorithm Question - Word Beginnings

I have a problem, and I'm not too sure how to solve it without going down the route of inefficiency. Say I have a list of words:

  • Apple
  • Ape
  • Arc
  • Abraid
  • Bridge
  • Braide
  • Bray
  • Boolean

What I want to do is process this list and get what each word starts with up to a certain depth, e.g.

  • a - Apple, Ape, Arc, Abraid
  • ab - Abraid
  • ar -Arc
  • ap - Apple, Ape
  • b - Bridge, Braide, Bray, Boolean
  • br - Bridge, Braide, Bray
  • bo - Boolean

Any ideas?

like image 349
Matthew H Avatar asked Dec 06 '22 01:12

Matthew H


2 Answers

You can use a Trie structure.

       (root)
         / 
        a - b - r - a - i - d
       / \   \
      p   r   e
     / \   \
    p   e   c
   /
  l
 /
e

Just find the node that you want and get all its descendants, e.g., if I want ap-:

       (root)
         / 
        a - b - r - a - i - d
       / \   \
     [p]  r   e
     / \   \
    p   e   c
   /
  l
 /
e
like image 104
quantumSoup Avatar answered Dec 07 '22 16:12

quantumSoup


Perhaps you're looking for something like:

    #!/usr/bin/env python
    def match_prefix(pfx,seq):
        '''return subset of seq that starts with pfx'''
        results = list()
        for i in seq:
            if i.startswith(pfx):
                results.append(i)
        return results

    def extract_prefixes(lngth,seq):
        '''return all prefixes in seq of the length specified'''
        results = dict()
        lngth += 1
        for i in seq:
            if i[0:lngth] not in results:
                results[i[0:lngth]] = True
        return sorted(results.keys())

    def gen_prefix_indexed_list(depth,seq):
        '''return a dictionary of all words matching each prefix
           up to depth keyed on these prefixes'''
        results = dict()
        for each in range(depth):
            for prefix in extract_prefixes(each, seq):
                results[prefix] = match_prefix(prefix, seq)
        return results


    if __name__ == '__main__':
        words='''Apple Ape Arc Abraid Bridge Braide Bray Boolean'''.split()
        test = gen_prefix_indexed_list(2, words)
        for each in sorted(test.keys()):
            print "%s:\t\t" % each,
            print ' '.join(test[each])

That is you want to generate all the prefixes that are present in a list of words between one and some number you'll specify (2 in this example). Then you want to produce an index of all words matching each of these prefixes.

I'm sure there are more elegant ways to do this. For for a quick and easily explained approach I've just built this from a simple bottom-up functional decomposition of the apparent spec. Of the end result values are lists each matching a given prefix, then we start with the function to filter out such matches from our inputs. If the end result keys are all prefixes between 1 and some N that appear in our input then we have a function to extract those. Then our spec. is an extremely straightforward nested loop around that.

Of course this nest loop might be a problem. Such things usually equate to an O(n^2) efficiency. As shown this will iterate over the original list C * N * N times (C is the constant number representing the prefixes of length 1, 2, etc; while N is the length of the list).

If this decomposition provides the desired semantics then we can look at improving the efficiency. The obvious approach would be to lazily generate the dictionary keys as we iterate once over the list ... for each word, for each prefix length, generate key ... append this word to the the list/value stored at that key ... and continue to the next word.

There's still a nested loop ... but it's the short loop for each key/prefix length. That alternative design has the advantage of allowing us to iterate over lists of words from any iterable, not just an in memory list. So we could iterate over lines of a file, results generated from a database query, etc --- without incurring the memory overhead of keeping the entire original word list in memory.

Of course we're still storing the dictionary in memory. However we can also change that, decouple the logic from the input and storage. When we append each input to the various prefix/key values we don't care if they're lists in a dictionary, or lines in a set of files, or values being pulled out of (and pushed back into) a DBM or other key/value store (for example some sort of CouchDB or other "noSQL clustered/database."

The implementation of that is left as an exercise to the reader.

like image 39
Jim Dennis Avatar answered Dec 07 '22 14:12

Jim Dennis