Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Burrows-Wheeler Transform without EOF character

I need to perform a well-known Burrows-Wheeler Transform in linear time. I found a solution with suffix sorting and EOF character, but appending EOF changes the transformation. For example: consider the string bcababa and two rotations

  • s1 = abababc
  • s2 = ababcab

it's clear that s1 < s2. Now with an EOF character:

  • s1 = ababa#bc
  • s2 = aba#bcab

and now s2 < s1. And the resulting transformation will be different. How can I perform BWT without EOF?

like image 422
user8078 Avatar asked May 10 '12 05:05

user8078


People also ask

Why does the Burrows-Wheeler transform work?

The Burrows-Wheeler transform (BWT) is an algorithm that takes blocks of data, such as strings, and rearranges them into runs of similar characters. After the transformation, the output block contains the same exact data elements before it had started, but differs in the ordering.

Why is Burrows-Wheeler transform useful for compression?

This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding.

What is Burrows-Wheeler aligner?

March 19, 2017. BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM.


2 Answers

You can perform the transform in linear time and space without the EOF character by computing the suffix array of the string concatenated with itself. Then iterate over the suffix array. If the current suffix array value is less than n, add to your output array the last character of the rotation starting at the position denoted by the current value in the suffix array. This approach will produce a slightly different BWT transform result, however, since the string rotations aren't sorted as if the EOF character were present.

A more thorough description can be found here: http://www.quora.com/Algorithms/How-I-can-optimize-burrows-wheeler-transform-and-inverse-transform-to-work-in-O-n-time-O-n-space

like image 122
John Kurlak Avatar answered Sep 30 '22 14:09

John Kurlak


You need to have EOF character in the string for BWT to work, because otherwise you can't perform the inverse transform to get the original string back. Without EOF, both strings "ba" and "ab" have the same transformed version ("ba"). With EOF, the transforms are different

ab        ba

a b |     a | b
b | a     b a |
| a b     | b a

i.e. ab transforms to "|ab" and ba to "b|a".

EOF is needed for BWT because it marks the point where the character cycle starts.

Re: doing it without the EOF character, according to Wikipedia,

Since any rotation of the input string will lead to the same transformed string, the BWT cannot be inverted without adding an 'EOF' marker to the input or, augmenting the output with information, such as an index, that makes it possible to identify the input string from the class of all of its rotations.

There is a bijective version of the transform, by which the transformed string uniquely identifies the original. In this version, every string has a unique inverse of the same length.

The bijective transform is computed by first factoring the input into a non-increasing sequence of Lyndon words; such a factorization exists by the Chen–Fox–Lyndon theorem, and can be found in linear time. Then, the algorithm sorts together all the rotations of all of these words; as in the usual Burrows–Wheeler transform, this produces a sorted sequence of n strings. The transformed string is then obtained by picking the final character of each of these strings in this sorted list.

like image 36
Antti Huima Avatar answered Sep 30 '22 13:09

Antti Huima