Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is it possible to build a suffix tree in linear time?

To build a suffix tree, in the worst case if all the letter of the string are different the complexity would be something like

n + (n-1) + (n-2) ... 1 = n*(n+1)/2 

which is O(n^2).

However according to http://en.wikipedia.org/wiki/Suffix_tree building a suffix tree takes O(n) time. What am I missing here?

like image 718
shreyasva Avatar asked Sep 17 '11 01:09

shreyasva


People also ask

How do you make a suffix tree?

We build a suffix tree by following each suffix and creating an edge for each character, starting with a top node. If the new suffix to be put in the tree begins with a set of characters that are already in the tree, we follow those characters until we have a different one, creating a new branch.

How does a suffix tree work?

In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

How do you construct a suffix array?

A suffix array can be constructed from Suffix tree by doing a DFS traversal of the suffix tree. In fact Suffix array and suffix tree both can be constructed from each other in linear time. A simple method to construct suffix array is to make an array of all suffixes and then sort the array.


1 Answers

Your intuition behind why the algorithm should be Θ(n2) is a good one, but most suffix trees are designed in a way that eliminates the need for this time complexity. Intuitively, it would seem that you need Θ(n2) different nodes to hold all of the different suffixes, because you'd need n + (n - 1) + ... + 1 different nodes. However, suffix trees are typically designed so that there isn't a single node per character in the suffix. Instead, each edge is typically labeled with a sequence of characters that are substrings of the original string. It still may seem that you'd need Θ(n2) time to construct this tree because you'd have to copy the substrings over to these edges, but typically this is avoided by a cute trick - since all the edges are labeled with strings that are substrings of the input, the edges can instead be labeled with a start and end position, meaning that an edge spanning Θ(n) characters can be constructed in O(1) time and using O(1) space.

That said, constructing suffix trees is still really hard to do. The Θ(n) algorithms referenced in Wikipedia aren't easy. One of the first algorithms found to work in linear time is Ukkonen's Algorithm, which is commonly described in textbooks on string algorithms (such as Algorithms on Strings, Trees, and Sequences). The original paper is linked in Wikipedia. More modern approaches work by first building a suffix array and using that to then construct the suffix tree.

Hope this helps!

like image 136
templatetypedef Avatar answered Oct 03 '22 18:10

templatetypedef