Using suffix array algorithm for Burrows Wheeler transform

Tags:

I've sucessfully implemented a BWT stage (using regular string sorting) for a compression testbed I'm writing. I can apply the BWT and then inverse BWT transform and the output matches the input. Now I wanted to speed up creation of the BW index table using suffix arrays. I have found 2 relatively simple, supposedly fast O(n) algorithms for suffix array creation, DC3 and SA-IS which both come with C++/C source code. I tried using the sources (out-of-the-box compiling SA-IS source can also be found here), but failed to get proper a proper suffix array / BWT index table out. Here's what I've done:

T=input data, SA=output suffix array, n=size of T, K=alphabet size, BWT=BWT index table
I work on 8-bit bytes, but both algorithms need a unique sentinel / EOF marker in form of a zero byte (DC3 needs 3, SA-IS needs one), thus I convert all my input data to 32-bit integers, increase all symbols by 1 and append the sentinel zero bytes. This is T.
I create an integer output array SA (of size n for DC3, n+1 for KA-IS) and apply the algorithms. I get results similar to my sorting BWT transform, but some values are odd (see UPDATE 1). Also the results of both algorithms differ slightly. The SA-IS algorithm produces an excess index value at the front, so all results need to be copied left by one index (SA[i]=SA[i+1]).
To convert the suffix array to the proper BWT indices, I subtract 1 from the suffix array values, do a modulo and should have the BWT indices (according to this): BWT[i]=(SA[i]-1)%n.

This is my code to feed the SA algorithms and convert to BWT. You should be able to more or less just plug in the SA construction code from the papers:

std::vector<int32_t> SuffixArray::generate(const std::vector<uint8_t> & data)
{
    std::vector<int32_t> SA;
    if (data.size() >= 2)
    {
        //copy data over. we need to append 3 zero bytes, 
        //as the algorithm expects T[n]=T[n+1]=T[n+2]=0
        //also increase the symbol value by 1, because the algorithm alphabet is [1,K]
        //(0 is used as an EOF marker)
        std::vector<int32_t> T(data.size() + 3, 0);
        std::copy(data.cbegin(), data.cend(), T.begin());
        std::for_each(T.begin(), std::prev(T.end(), 3), [](int32_t & n){ n++; });
        SA.resize(data.size());
        SA_DC3(T.data(), SA.data(), data.size(), 256);

        OR

        //copy data over. we need to append a zero byte, 
        //as the algorithm expects T[n-1]=0 (where n is the size of its input data)
        //also increase the symbol value by 1, because the algorithm alphabet is [1,K] 
        //(0 is used as an EOF marker)
        std::vector<int32_t> T(data.size() + 1, 0);
        std::copy(data.cbegin(), data.cend(), T.begin());
        std::for_each(T.begin(), std::prev(T.end(), 1), [](int32_t & n){ n++; });
        SA.resize(data.size() + 1); //crashes if not one extra byte at the end
        SA_IS((unsigned char *)T.data(), SA.data(), data.size() + 1, 256, 4); //algorithm expects size including sentinel
        std::rotate(SA.begin(), std::next(SA.begin()), SA.end()); //rotate left by one to get same result as DC3
        SA.resize(data.size());
    }
    else
    {
        SA.push_back(0);
    }
    return SA;
}

void SuffixArray::toBWT(std::vector<int32_t> & SA)
{
    std::for_each(SA.begin(), SA.end(), [SA](int32_t & n){ n = ((n - 1) < 0) ? (n + SA.size() - 1) : (n - 1); });
}

What am I doing wrong?

UPDATE 1
When applying the algorithms to short amounts of test text data like "yabbadabbado" / "this is a test." / "abaaba" or a big text file (alice29.txt from the Canterbury corpus) they work fine. Actually the toBWT() function isn't even necessary.
When applying the algorithms to binary data from a file containing the full 8-bit byte alphabet (executable etc.), they don't seem to work correctly. Comparing the results of the algorithms to that of the regular BWT indices, I notice erroneous indices (4 in my case) at the front. The number of indices (incidently?) corresponds to the recursion depth of the algorithms. The indices point to where the original source data had the last occurrences of 0s (before I converted them to 1s when building T)...

UPDATE 2
There are more differing values when I binary compare the regular BWT array and the suffix array. This might be expected, as afair sorting must not necessarily be the same as with a standard sort, BUT the resulting data transformed by the arrays should be the same. It is not.

UPDATE 3
I tried modifying a simple input string till both algorithm "failed". After changing two bytes of the string "this is a test." to 255 or 0 (from 74686973206973206120746573742Eh to e.g. 746869732069732061FF74657374FFh, the last byte has to be changed!) the indices and transformed string are not correct anymore. It also seems to be enough to change the last character of the string to a character already ocurring in the string, e.g. "this is a tests" 746869732069732061207465737473h. Then two indices and two characters of the transformed strings will swapped (comparing regular sorting BWT and BWT that uses SAs).

I find the whole process of having to convert the data to 32-bit a bit awkward. If somebody has a better solution (paper, better yet, some source code) to generate a suffix array DIRECTLY from a string with an 256-char alphabet, I'd be happy.

302

asked Jan 06 '16 18:01

Bim

1 Answers

I have now figured this out. My solution was two-fold. Some people suggested using a library, which I did SAIS-lite by Yuta Mori.
The real solution was to duplicate and concatenate the input string and run the SA-generation on this string. When saving the output string you need to filter out all SA indices above the original data size. This is not an ideal solution, because you need to allocate twice as much memory, copy twice and do the transform on the double amount of data, but it is still 50-70% faster than std::sort. If you have a better solution, I'd love to hear it.
You can find the updated code here.

171

answered Nov 15 '22 03:11

Bim

Related questions
                            
                                Is mmap + madvise really a form of async I/O?
                            
                                How to Optimizing a VBO/IBO to maximize GPU cache usage
                            
                                Sort map with (Spanish) accented words in Rcpp
                            
                                error: jump to label 'foo' crosses initialization of 'bar'
                            
                                Installing visual studio 6 on windows 10
                            
                                Can I share a external texture between 2 OpenGL contexts, Android
                            
                                Creating a class in a DLL using std::string. C4251 warning
                            
                                Initializing An Array of Unknown Dimensionality
                            
                                When do these load DLLs : Implicit Linking VS Explicit Linking
                            
                                Boost type_erasure any with const member function
                            
                                How to set scons to output full expanded command line?
                            
                                How to force cmake link against custom gcc libraries
                            
                                Why are all my C++ programs exiting with 0xc0000139?
                            
                                How should I define a set number of bits in a typedef?
                            
                                Find out Current Directory during Visual Studio C++ Debug session
                            
                                Using inheritance within a union
                            
                                Is it legal for a standard library implementation to specialize a function templated on a concept with a child concept?
                            
                                Reference or pointer to std::vector of incomplete type
                            
                                Why gcc emits unneeded memory accesses when passing trivial struct by value?
                            
                                (involving explicit)Precedence with operator and constructor conversion

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using suffix array algorithm for Burrows Wheeler transform

Tags:

c++

algorithm

suffix-array

burrows-wheeler-transform

Bim

People also ask

1 Answers

Bim

Recent Activity

Donate For Us