most efficient data structure for a read-only list of strings (about 100,000) with fast prefix search

Tags:

I'm writing an application that needs to read a list of strings from a file, save them in a data structure, and then look up those strings by prefixes. The list of strings is simply a list of words in a given language. For example, if the search function gets "stup" as a parameter, it should return ["stupid", "stupidity", "stupor"...]. It should do so in O(log(n)*m) time, where n is the size of the data structure and m is the number of results and should be as fast as possible. Memory consumption is not a big issue right now. I'm writing this in python, so it would be great if you could point me to a suitable data structure (preferably) implemented in c with python wrappers.

581

asked Jul 15 '09 12:07

Kim Stebel

2 Answers

You want a trie.

http://en.wikipedia.org/wiki/Trie

I've used them in Scrabble and Boggle programs. They're perfect for the use case you described (fast prefix lookup).

Here's some sample code for building up a trie in Python. This is from a Boggle program I whipped together a few months ago. The rest is left as an exercise to the reader. But for prefix checking you basically need a method that starts at the root node (the variable words), follows the letters of the prefix to successive child nodes, and returns True if such a path is found and False otherwise.

class Node(object):
    def __init__(self, letter='', final=False):
        self.letter = letter
        self.final = final
        self.children = {}
    def __contains__(self, letter):
        return letter in self.children
    def get(self, letter):
        return self.children[letter]
    def add(self, letters, n=-1, index=0):
        if n < 0: n = len(letters)
        if index >= n: return
        letter = letters[index]
        if letter in self.children:
            child = self.children[letter]
        else:
            child = Node(letter, index==n-1)
            self.children[letter] = child
        child.add(letters, n, index+1)

def load_dictionary(path):
    result = Node()
    for line in open(path, 'r'):
        word = line.strip().lower()
        result.add(word)
    return result

words = load_dictionary('dictionary.txt')

149

answered Oct 24 '22 08:10

FogleBird

Some Python implementations of tries:

http://jtauber.com/2005/02/trie.py
http://www.koders.com/python/fid7B6BC1651A9E8BBA547552FE3F039479A4DECC45.aspx
http://filoxus.blogspot.com/2007/11/trie-in-python.html

answered Oct 24 '22 08:10

ThibThib

Related questions
                            
                                Double header in Matplotlib Table
                            
                                Can't start spyder because of PyQt5.QtWebKitWidgets
                            
                                How to rewrite this simple loop using assignment expressions introduced in Python 3.8 alpha?
                            
                                Django messages middleware issue while testing post request
                            
                                How to convert single list's elements in form of dictionary
                            
                                PyInstaller exe returning error on a Tkinter script
                            
                                find index of a value before the maximum for each column in python dataframe
                            
                                How to separate Pandas column that contains values stored as text and numbers into two seperate columns
                            
                                How to flatten a list that has: primitives data types, lists and generators?
                            
                                How can I select rows from a Pandas dataframe were any value is not equal to a number?
                            
                                How to see complete rows in Google Colab
                            
                                Python sklearn installation windows
                            
                                Correct way of normalizing and scaling the MNIST dataset
                            
                                How to use LanguageDetector() from spacy_langdetect package?
                            
                                python creating new list using a "template list"
                            
                                How can I capture all exceptions from a wxPython application?
                            
                                Removing a subset of a dict from within a list
                            
                                ReadInt(), ReadByte(), ReadString(), etc. in Python?
                            
                                External classes in Python
                            
                                regex for parsing SQL statements

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

most efficient data structure for a read-only list of strings (about 100,000) with fast prefix search

Tags:

python

dictionary

lookup

data-structures

Kim Stebel

People also ask

2 Answers

FogleBird

ThibThib

Recent Activity

Donate For Us