Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using suffix array algorithm for Burrows Wheeler transform

I've sucessfully implemented a BWT stage (using regular string sorting) for a compression testbed I'm writing. I can apply the BWT and then inverse BWT transform and the output matches the input. Now I wanted to speed up creation of the BW index table using suffix arrays. I have found 2 relatively simple, supposedly fast O(n) algorithms for suffix array creation, DC3 and SA-IS which both come with C++/C source code. I tried using the sources (out-of-the-box compiling SA-IS source can also be found here), but failed to get proper a proper suffix array / BWT index table out. Here's what I've done:

  1. T=input data, SA=output suffix array, n=size of T, K=alphabet size, BWT=BWT index table

  2. I work on 8-bit bytes, but both algorithms need a unique sentinel / EOF marker in form of a zero byte (DC3 needs 3, SA-IS needs one), thus I convert all my input data to 32-bit integers, increase all symbols by 1 and append the sentinel zero bytes. This is T.

  3. I create an integer output array SA (of size n for DC3, n+1 for KA-IS) and apply the algorithms. I get results similar to my sorting BWT transform, but some values are odd (see UPDATE 1). Also the results of both algorithms differ slightly. The SA-IS algorithm produces an excess index value at the front, so all results need to be copied left by one index (SA[i]=SA[i+1]).

  4. To convert the suffix array to the proper BWT indices, I subtract 1 from the suffix array values, do a modulo and should have the BWT indices (according to this): BWT[i]=(SA[i]-1)%n.

This is my code to feed the SA algorithms and convert to BWT. You should be able to more or less just plug in the SA construction code from the papers:

std::vector<int32_t> SuffixArray::generate(const std::vector<uint8_t> & data)
{
    std::vector<int32_t> SA;
    if (data.size() >= 2)
    {
        //copy data over. we need to append 3 zero bytes, 
        //as the algorithm expects T[n]=T[n+1]=T[n+2]=0
        //also increase the symbol value by 1, because the algorithm alphabet is [1,K]
        //(0 is used as an EOF marker)
        std::vector<int32_t> T(data.size() + 3, 0);
        std::copy(data.cbegin(), data.cend(), T.begin());
        std::for_each(T.begin(), std::prev(T.end(), 3), [](int32_t & n){ n++; });
        SA.resize(data.size());
        SA_DC3(T.data(), SA.data(), data.size(), 256);

        OR

        //copy data over. we need to append a zero byte, 
        //as the algorithm expects T[n-1]=0 (where n is the size of its input data)
        //also increase the symbol value by 1, because the algorithm alphabet is [1,K] 
        //(0 is used as an EOF marker)
        std::vector<int32_t> T(data.size() + 1, 0);
        std::copy(data.cbegin(), data.cend(), T.begin());
        std::for_each(T.begin(), std::prev(T.end(), 1), [](int32_t & n){ n++; });
        SA.resize(data.size() + 1); //crashes if not one extra byte at the end
        SA_IS((unsigned char *)T.data(), SA.data(), data.size() + 1, 256, 4); //algorithm expects size including sentinel
        std::rotate(SA.begin(), std::next(SA.begin()), SA.end()); //rotate left by one to get same result as DC3
        SA.resize(data.size());
    }
    else
    {
        SA.push_back(0);
    }
    return SA;
}

void SuffixArray::toBWT(std::vector<int32_t> & SA)
{
    std::for_each(SA.begin(), SA.end(), [SA](int32_t & n){ n = ((n - 1) < 0) ? (n + SA.size() - 1) : (n - 1); });
}

What am I doing wrong?

UPDATE 1
When applying the algorithms to short amounts of test text data like "yabbadabbado" / "this is a test." / "abaaba" or a big text file (alice29.txt from the Canterbury corpus) they work fine. Actually the toBWT() function isn't even necessary.
When applying the algorithms to binary data from a file containing the full 8-bit byte alphabet (executable etc.), they don't seem to work correctly. Comparing the results of the algorithms to that of the regular BWT indices, I notice erroneous indices (4 in my case) at the front. The number of indices (incidently?) corresponds to the recursion depth of the algorithms. The indices point to where the original source data had the last occurrences of 0s (before I converted them to 1s when building T)...

UPDATE 2
There are more differing values when I binary compare the regular BWT array and the suffix array. This might be expected, as afair sorting must not necessarily be the same as with a standard sort, BUT the resulting data transformed by the arrays should be the same. It is not.

UPDATE 3
I tried modifying a simple input string till both algorithm "failed". After changing two bytes of the string "this is a test." to 255 or 0 (from 74686973206973206120746573742Eh to e.g. 746869732069732061FF74657374FFh, the last byte has to be changed!) the indices and transformed string are not correct anymore. It also seems to be enough to change the last character of the string to a character already ocurring in the string, e.g. "this is a tests" 746869732069732061207465737473h. Then two indices and two characters of the transformed strings will swapped (comparing regular sorting BWT and BWT that uses SAs).

I find the whole process of having to convert the data to 32-bit a bit awkward. If somebody has a better solution (paper, better yet, some source code) to generate a suffix array DIRECTLY from a string with an 256-char alphabet, I'd be happy.

like image 302
Bim Avatar asked Jan 06 '16 18:01

Bim


People also ask

What is the Burrows-Wheeler transform of the string?

The Burrows-Wheeler transform (BWT) is an algorithm that takes blocks of data, such as strings, and rearranges them into runs of similar characters. After the transformation, the output block contains the same exact data elements before it had started, but differs in the ordering.

Why is Burrows-Wheeler transform useful for compression?

This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding.

What is a suffix array in data structure?

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full text indices, data compression algorithms, and the field of bibliometrics.


1 Answers

I have now figured this out. My solution was two-fold. Some people suggested using a library, which I did SAIS-lite by Yuta Mori.
The real solution was to duplicate and concatenate the input string and run the SA-generation on this string. When saving the output string you need to filter out all SA indices above the original data size. This is not an ideal solution, because you need to allocate twice as much memory, copy twice and do the transform on the double amount of data, but it is still 50-70% faster than std::sort. If you have a better solution, I'd love to hear it.
You can find the updated code here.

like image 171
Bim Avatar answered Nov 15 '22 03:11

Bim