really hard to understand suffix tree

Tags:

I've been searching for tutorials about suffix tree for quite a while. In SO, I found 2 posts about understanding suffix tree: 1, 2.

But I can't say that I understand how to build one, Oops. In Skiena's book Algorithm design manual, he says:

Since linear time suffix tree construction algorithms are nontrivial, I recommend using an existing implementation.

Well, is the on-line construction algorithm for suffix tree so hard? Anybody can put me in the right direction to understand it?

Anyway, cut to the chase, besides the construction, there is one more thing I don't understand about suffix tree. Because the edges in suffix tree are just a pair of integers (right?) specifying the starting and ending pos of the substring, then if I want to search a string x in this suffix tree, how should I do it? De-reference those integers in the suffix tree, then compare them one by one with x? Couldn't be this way.

871

asked Mar 05 '12 01:03

Alcott

2 Answers

Firstly, there are many ways to construct a suffix tree. There is the original O(n) method by Weiner (1973), the improved one by McCreight (1976), the most well-known by Ukkonen (1991/1992), and a number of further improvements, largely related to implementation and storage efficiency considerations. Most notable among those is perhaps the Efficient implementation of lazy suffix trees by Giegerich and Kurtz.

Moreover, since the direct construction of suffix arrays has become possible in O(n) time in 2003 (e.g. using the Skew algorithm, but there are others as well), and since there are well-studied methods for

emulating suffix trees using suffix arrays (e.g. Abouelhoda/Kurtz 2004)
compressing suffix arrays (see Navarro/Mäkinen 2007 for a survey)

suffix arrays are usually preferred over suffix trees. Therefore, if your intention is to build a highly optimised implementation for a specific purpose, you might want to look into studying suffix array construction algorithms.

However, if your interest is in suffix tree construction, and in particular the Ukkonen algorithm, I would like to suggest that you take a close look at the description in this SO post, which you mentioned already, and we try to improve that description together. It's certainly far from a perfectly intuitive explanation.

To answer the question about how to compare input string to edge labels: For efficiency reasons during construction and look-up, the initial character of every edge label is usually stored in the node. But the rest must be looked up in the main text string, just like you said, and indeed this can cause issues, in particular when the string is so large that it cannot readily be held in memory. That (plus the fact that, like any direct implementation of a tree, the suffix tree is a data structure that contains a lot of pointers, which consume much memory and make it hard to maintain locality of reference and to benefit from memory caching) is one of the main reasons why suffix trees are so much harder to handle than e.g. inverted indexes.

answered Nov 15 '22 12:11

jogojapan

If you combine the suffix array with an lcp table and a child table, which of course you should do, you essentially get a suffix tree. This point is made in the paper: Linearized Suffix Trees by Kim, Park and Kim. The lcp table enables a rather awkward bottom-up traversal, and the child table enables an easy traversal of either kind. So the story about suffix trees using pointers causing locality of reference problems is in my opinion obsolete information. The suffix tree is therefore``the right and easy way to go,'' as long as you implement the tree using an underlying suffix array.

The paper by Kim, Park and Kim describes a variant of the approach in Abouelhoda et al's misleadingly titled paper: Replacing suffix trees with enhanced suffix arrays. The Kim et al paper get it right that this is an implementation of suffix trees, and not a replacement. Moreover, the details of Abouelhoda et al's construction are more simply and intuitively described in Kim et al.

answered Nov 15 '22 14:11

Dale Gerdemann

Related questions
                            
                                How to left join fetch multiple children in Hibernate?
                            
                                Using WebKit for a desktop application [closed]
                            
                                Do multi-core CPUs share the MMU and page tables?
                            
                                To Convert Image to Cartoon in ios
                            
                                Android:Bug in build path
                            
                                Setup an SSL certificate on an EC2 instance
                            
                                How to a synchronize tasks?
                            
                                C# Entity Framework Pagination
                            
                                Why does ARC retain method arguments?
                            
                                select max() from count() [duplicate]
                            
                                Using C# to read/write Excel files (.xls/.xlsx)
                            
                                Arduino UNO basics for C# [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With