Trie (Prefix Tree) in Python

Tags:

I don't know if this is the place to ask about algorithms. But let's see if I get any answers ... :)

If anything is unclear I'm very happy to clarify things.

I just implemented a Trie in python. However, one bit seemed to be more complicated than it ought to (as someone who loves simplicity). Perhaps someone has had a similar problem?

My aim was to minimize the number of nodes by storing the largest common prefix of a sub-trie in its root. For example, if we had the words stackoverflow, stackbase and stackbased, then the tree would look something like this:

              [s]tack
[o]verflow ______/ \_______ [b]ase
                                  \___ [d]

Note that one can still think of the edges having one character (the first one of the child node).

Find-query is simple to implement. Insertion is not hard, but somewhat more complex than I want.. :(

My idea was to insert the keys one after the other (starting from an empty trie), by first searching for the to-be-inserted key k (Find(k)), and then rearranging/splitting the nodes locally at the place where the find-procedure stops. There turn out to be 4 cases: (Let k be the key we want to insert, and k' be the key of the node, where the search ended)

k is identical to k'
k is a "proper" prefix of k'
k' is a "proper" prefix of k
k and k' share some common prefix, but none of the cases (1), (2) or (3) occur.

It seems that each of the cases are unique and thus imply different modifications of the Trie. BUT: is it really that complex? Am I missing something? Is there a better approach?

Thanks :)

465

asked Jun 07 '09 01:06

jacob

4 Answers

At a glance, it sounds like you've implemented a Patricia Trie. This approach also is called path compression in some of the literature. There should be copies of that paper that aren't behind the ACM paywall, which will include an insertion algorithm.

There's also another compression method you may want to look at: level compression. The idea behind path compression is to replace strings of single child nodes with a single super node that has a "skip" count. The idea behind level compression is to replace full or nearly full subtrees with a super node with a "degree" count that says how many digits of the key the node decodes. There's also a 3rd approach called width compression, but I'm afraid my memory fails me and I couldn't find a description of it with quick googling.

Level compression can shorten the average path considerably, but insertion and removal algorithms get quite complicated as they need to manage the trie nodes as similarly to dynamic arrays. For the right data sets, level compressed trees can be fast. From what I remember, they're the 2nd fastest approach for storing IP routing tables, the fastest is some sort of hash trie.

198

answered Oct 05 '22 10:10

Jason Watkins

I don't see anything wrong with your approach. If you're looking for a spike solution, perhaps the action taken in case 4 is actually feasible for the first three cases, IE find the common prefix to k and k' and rebuild the node with that in mind. If it happens that the keys were prefixes of one-another, the resulting trie will still be correct, only the implementation did a bit more work than it really had to. but then again, without any code to look at it's hard to say if this works in your case.

answered Oct 05 '22 09:10

SingleNegationElimination

Somewhat of a tangent, but if you are super worried about the number of nodes in your Trie, you may look at joining your word suffixes too. I'd take a look at the DAWG (Directed Acyclic Word Graph) idea: http://en.wikipedia.org/wiki/Directed_acyclic_word_graph

The downside of these is that they aren't very dynamic and creating them can be difficult. But, if your dictionary is static, they can be super compact.

answered Oct 05 '22 09:10

Joe Beda

I have a question regarding your implementation. What is the level of granularity that you decide to split your strings on to make the prefix tree. You could split stack as either s,t,a,c,k or st,ta,ac,ck and many other ngrams of it. Most prefix tree implementations take into account an alphabet for the language, based on this alphabet, you do the splitting.

If you were building a prefix tree implementation for python then your alphabets would be things like def, : , if , else... etc

Choosing the right alphabet makes a huge difference in building efficient prefix trees. As for your answers, you could look for PERL packages on CPAN which do longest common substring computation using trie's. You may have some luck there as most of their implementation is pretty robust.

answered Oct 05 '22 09:10

Ritesh M Nayak

Related questions
                            
                                pytest capsys: checking output AND getting it reported?
                            
                                Problems in implementing Horner's method in Python
                            
                                Splines with Python (using control knots and endpoints)
                            
                                Python: TypeError: Pickling an AuthenticationString object is disallowed for security reasons
                            
                                An array field in scrapy.Item
                            
                                how can I flatten an 2d numpy array, which has different length in the second axis?
                            
                                how to deactivate virtualenv from a bash script
                            
                                Overriding dict.update() method in subclass to prevent overwriting dict keys
                            
                                Asyncio + aiohttp - redis Pub/Sub and websocket read/write in single handler
                            
                                Asyncio RuntimeError: Event Loop is Closed
                            
                                LINK : fatal error LNK1104: cannot open file 'python27.lib'
                            
                                Fatal error C1083: Cannot open include file: 'openssl/opensslv.h'
                            
                                Python Pandas read_excel dtype str replace nan by blank ('') when reading or when writing via to_csv
                            
                                What is the difference between PyCUDA and NumbaPro CUDA Python?
                            
                                Python numpy.corrcoef() RuntimeWarning: invalid value encountered in true_divide c /= stddev[:, None]
                            
                                How to penalize False Negatives more than False Positives
                            
                                Chaining tests and passing an object from one test to another
                            
                                AttributeError: module 'numpy' has no attribute '__version__'
                            
                                Makefile can't use `conda activate`
                            
                                Cannot run apache airflow after fresh install, python import error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Trie (Prefix Tree) in Python

Tags:

python

algorithm

trie

jacob

People also ask

4 Answers

Jason Watkins

SingleNegationElimination

Joe Beda

Ritesh M Nayak

Recent Activity

Donate For Us