Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the current state-of-the-art suffix array construction algorithm?

Tags:

suffix-array

I'm looking for a fast suffix-array construction algorithm. I'm more interested in ease of implementation and raw speed than asymptotic complexity (I know that a suffix array can be constructed by means of a suffix tree in O(n) time, but that takes a lot of space; apparently other algorithms have bad worst-case big-O complexity, but run quite fast in practice). I don't mind algorithms that generate an LCP array as a by-product, since I need one anyway for my own purposes.

I found a taxonomy of various suffix array construction algorithms, but it's out of date. I've heard of SA-IS, qsufsort, and BPR, but I don't really know how they compare to each other, nor if there are even better algorithms. Considering how hot the suffix-array field seems to be right now, I expect that some other algorithms have superseded those since they were published. I seem to recall coming across a paper describing a fast algorithm called "split", but I can't find it now for the life of me.

So, what's the current state of the art? Ideally, I'd like a short list of the current best algorithms (with links, if possible) and a quick overview of their relative strengths and weaknesses.

like image 922
Cameron Avatar asked Oct 22 '11 05:10

Cameron


People also ask

What is suffix array construction?

Suffix array construction means simply sorting the set of all suffixes. • Using standard sorting or string sorting the time complexity is Ω(DP(T[0..n] )). • Another possibility is to first construct the suffix tree and then traverse it from left to right to collect the suffixes in lexicographical order.

How do you find the suffix of an array?

After we construct the generalized Suffix Array of the concatenation of both strings T1$T2# of length n = n1+n2 in O(n log n) and compute its LCP Array in O(n), we can find the Longest Repeated Substring (LRS) in T by simply iterating through all LCP values and reporting the largest one that comes from two different ...

What will be the suffix array of the string engineering?

6. What will be the suffix array of the string “engineering”? Explanation: Correct choice is : 5 0 6 10 2 3 8 4 9 1 7. Because the suffix array formed will be: 5 eering 0 engineering 6 ering 10 g 2 gineering 3 ineering 8 ing 4 neering 9 ng 1 ngineering 7 ring.

What is the use of suffix array?

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full text indices, data compression algorithms, and the field of bibliometrics.


2 Answers

Currently, the best Suffix-Array constructor known is LibDivSufSort, by Yuta Mori : http://code.google.com/p/libdivsufsort/

It uses Induced Sorting methodology (Basically, after sorting all strings starting with "A*", you can induce sortings of strings "BA*" "CA*" "DA*" etc.)

It is praised everywhere for its efficiency and nice handling of degenerated cases. It's also the fastest, and uses the optimal amount of memory (5N). The license is unobstrusive, so this algorithm is integrated into several other programs, such as for example libbsc compression library, by Ilya Grebnov. http://libbsc.com/default.aspx

For comparison purposes, you will find a Suffix Array Compression benchmarks linked at this page : http://code.google.com/p/libdivsufsort/wiki/SACA_Benchmarks and this page http://homepage3.nifty.com/wpage/benchmark/index.html

The benchmark also lists many other worthy SACA implementation. Nevertheless, for both license and efficiency reason, i would recommend libdivsufsort among them.

Edit : Apparently, MSufsort is said to be heading towards version 4 soon, and is supposed to become quite faster than Divsufsort. If that is right, it would become a new SACA champion. But still, we don't have release date yet, and this will be alpha stuff. So if you need a proven implementation now, libdivsufsort remains the best choice.

Note also that these "best SACA implementations" do not use "one construction algorithm", but combine several optimisations tricks, which makes them difficult to summarize.

like image 151
Cyan Avatar answered Oct 21 '22 13:10

Cyan


https://github.com/y-256/libdivsufsort/blob/wiki/SACA_Benchmarks.md gives the list of fastest algorithms you want.

kvark's implementation is the fastest in most of case in the above benchmark. You can find the code on https://github.com/kvark/dark-archon.

libdivsufsort is a combination of Itoh-Tanaka's algorithm and the Induced Sorting's post process. A subset of suffixes is picked just like the induced sorting algorithm, but instead of recursively solve it by induced sorting, they are sorted by IT's algorithm.

libdivsufsort and kvark's implementation are both based on IT's algorithm.

Ko-Aluru's algorithm is similar to the IT's algorithm, which appears in 99. They both divide the suffixes into 2 categories: type S or type L. If the i-th suffix is smaller than the (i+1)-th suffix, it's type S; otherwise it's type L. After sorting all type S suffixes, we can deduce the order of all type L suffixes, then we're done.

The difference between KA's algorithm and IT's algorithm is that: KA use recursion to sort the substrings, while IT's algorithm appeals to Multikey Quicksort/MSD/insertion sort.

like image 44
Zhe Yang Avatar answered Oct 21 '22 12:10

Zhe Yang