Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

most efficient data structure for a read-only list of strings (about 100,000) with fast prefix search

I'm writing an application that needs to read a list of strings from a file, save them in a data structure, and then look up those strings by prefixes. The list of strings is simply a list of words in a given language. For example, if the search function gets "stup" as a parameter, it should return ["stupid", "stupidity", "stupor"...]. It should do so in O(log(n)*m) time, where n is the size of the data structure and m is the number of results and should be as fast as possible. Memory consumption is not a big issue right now. I'm writing this in python, so it would be great if you could point me to a suitable data structure (preferably) implemented in c with python wrappers.

like image 581
Kim Stebel Avatar asked Jul 15 '09 12:07

Kim Stebel


People also ask

What is the fastest data structure for searching?

The best data structure for faster searching of string is TRIE. Tries are an extremely special and useful data-structure that are based on the prefix of a string. They are used to represent the “Retrieval” of data. A Trie is a special data structure used to store strings that can be visualized like a graph.

Which is the most efficient data structure?

Arrays. An array is a linear data structure that holds an ordered collection of values. It's the most efficient in storing and accessing a sequence of objects.

Which data structure is most efficient in terms of both space and time?

2 Answers. i think linked list would be better ,it will take constant space and O(n) time complexity.


2 Answers

You want a trie.

http://en.wikipedia.org/wiki/Trie

I've used them in Scrabble and Boggle programs. They're perfect for the use case you described (fast prefix lookup).

Here's some sample code for building up a trie in Python. This is from a Boggle program I whipped together a few months ago. The rest is left as an exercise to the reader. But for prefix checking you basically need a method that starts at the root node (the variable words), follows the letters of the prefix to successive child nodes, and returns True if such a path is found and False otherwise.

class Node(object):
    def __init__(self, letter='', final=False):
        self.letter = letter
        self.final = final
        self.children = {}
    def __contains__(self, letter):
        return letter in self.children
    def get(self, letter):
        return self.children[letter]
    def add(self, letters, n=-1, index=0):
        if n < 0: n = len(letters)
        if index >= n: return
        letter = letters[index]
        if letter in self.children:
            child = self.children[letter]
        else:
            child = Node(letter, index==n-1)
            self.children[letter] = child
        child.add(letters, n, index+1)

def load_dictionary(path):
    result = Node()
    for line in open(path, 'r'):
        word = line.strip().lower()
        result.add(word)
    return result

words = load_dictionary('dictionary.txt')
like image 149
FogleBird Avatar answered Oct 24 '22 08:10

FogleBird


Some Python implementations of tries:

  • http://jtauber.com/2005/02/trie.py
  • http://www.koders.com/python/fid7B6BC1651A9E8BBA547552FE3F039479A4DECC45.aspx
  • http://filoxus.blogspot.com/2007/11/trie-in-python.html
like image 4
ThibThib Avatar answered Oct 24 '22 08:10

ThibThib