Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory Efficient data structure for Trie Implementation

I am implementing a Trie in python. Till now I have come across two different methods to implement it:

1) use a class Node (similar to struct Node in C++) with data members:

char - to store character
is_end - to store end of word (true or false)
prefix_count - store number of words with current prefix

child - Node type dict (to store other nodes i.e. for 26 alphabets)

class Node(object):
    def __init__(self):
        self.char = ''
        self.word = ''
        self.is_end = False
        self.prefix_count = 0
        self.child = {}

2) use a dictionary to store all the data:

e.g. for the input words = {'foo', 'bar', 'baz', 'barz'}

we derive this dictionary:

         {'b': {'a': {'r': {'_end_': '_end_', 'z': {'_end_': '_end_'}},
          'z': {'_end_': '_end_'}}},
          'f': {'o': {'o': {'_end_': '_end_'}}}}

Which is the efficient and standard data structure, which is both memory efficient and fast for traversal and other trie operations on big dataset of words?

like image 687
divyum Avatar asked Apr 19 '15 18:04

divyum


2 Answers

Why not both? Just yesterday I was implementing a similar data structure for storing and retrieving a hierarchy of objects and contemplated this exact situation. Ended up using a Node object with a children dictionary. The main node being an object allows you to have custom methods for printing it or getting stuff, and you can even have a lazy initialization if needed (you mentioned big dataset right?)

class Node(object):
    def __init__(self):
        self.children = collections.defaultdict(lambda:self.__class__())
        self.stuff = ...
    def __str__(self,depth=0):
        if self.children:
            return str(self.stuff)+'\n'+'\n'.join([' '*depth+k+' ' + v.__str__(depth+1) for k,v in self.children.items()])
        else:
            return str(self.stuff)
    def get_path(self,path):
        node = self
        while path:
            node = node.children[path.pop(0)]
        ...
like image 63
Josep Valls Avatar answered Sep 28 '22 16:09

Josep Valls


A direct replacement would be nested a list;

However [arguably] more Pythonic, more compact in memory and thus faster for lookup would be a nested tuple.

Of course, updating such trie becomes logN, as you'll have to recreate every ancestor node. Still, if your lookups are a lot more frequent than updates, it may be worth it.

like image 26
Dima Tisnek Avatar answered Sep 28 '22 15:09

Dima Tisnek