Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String similarity: how exactly does Bitap work?

Tags:

I'm trying to wrap my head around the Bitap algorithm, but am having trouble understanding the reasons behind the steps of the algorithm.

I understand the basic premise of the algorithm, which is (correct me if i'm wrong):

Two strings:     PATTERN (the desired string)
                 TEXT (the String to be perused for the presence of PATTERN)

Two indices:     i (currently processing index in PATTERN), 1 <= i < PATTERN.SIZE
                 j (arbitrary index in TEXT)

Match state S(x): S(PATTERN(i)) = S(PATTERN(i-1)) && PATTERN[i] == TEXT[j], S(0) = 1

In english terms, PATTERN.substring(0,i) matches a substring of TEXT if the previous substring PATTERN.substring(0, i-1) was successfully matched and the character at PATTERN[i] is the same as the character at TEXT[j].

What I don't understand is the bit-shifting implementation of this. The official paper detailing this algorithm basically lays it out, but I can't seem to visualize what's supposed to go on. The algorithm specification is only the first 2 pages of the paper, but I'll highlight the important parts:

Here is the bit-shifting version of the concept:

enter image description here

Here is T[text] for a sample search string:

enter image description here

And here is a trace of the algorithm.

enter image description here

Specifically, I don't understand what the T table signifies, and the reason behind ORing an entry in it with the current state.

I'd be grateful if anyone can help me understand what exactly is going on

like image 381
Kevin Avatar asked Jul 03 '12 19:07

Kevin


People also ask

How does Bitap algorithm work?

The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance – if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal.

Why is string matching algorithm important?

String matching algorithms have greatly influenced computer science and play an essential role in various real-world problems. It helps in performing time-efficient tasks in multiple domains. These algorithms are useful in the case of searching a string within another string.


2 Answers

T is slightly confusing because you would normally number positions in the pattern from left to right:

0 1 2 3 4
a b a b c

...whereas bits are normally numbered from right to left.

But writing the pattern backwards above the bits makes it clear:

  bit: 4 3 2 1 0

       c b a b a
T[a] = 1 1 0 1 0

       c b a b a
T[b] = 1 0 1 0 1

       c b a b a
T[c] = 0 1 1 1 1

       c b a b a
T[d] = 1 1 1 1 1

Bit n of T[x] is 0 if x appears in position n, or 1 if it does not.

Equivalently, you can think of this as saying that if the current character in the input string is x, and you see a 0 in position n of T[x], then you can only possibly be matching the pattern if the match started n characters previously.


Now to the matching procedure. A 0 in bit n of the state means that we started matching the pattern n characters ago (where 0 is the current character). Initially, nothing matches.

  [start]
1 1 1 1 1

As we consume characters trying to match, the state is shifted left (which shifts a zero in to the bottom bit, bit 0) and OR-ed with the table entry for the current character. The first character is a; shifting left and OR-ing in T[a] gives:

        a
1 1 1 1 0

The 0 bit that was shifted in is preserved, because a current character of a can begin a match of the pattern. For any other character, the bit would be have been set to 1.

The fact that bit 0 of the state is now 0 means that we started matching the pattern on the current character; continuing, we get:

      a b
1 1 1 0 1

...because the 0 bit has been shifted left - think of it as saying that we started matching the pattern 1 character ago - and T[b] has a 0 in the same position, telling us that a seeing a b in the current position is good if we started matching 1 character ago.

    a b d
1 1 1 1 1

d can't match anywhere; all the bits get set back to 1.

  a b d a
1 1 1 1 0

As before.

a b d a b
1 1 1 0 1

As before.

b d a b a
1 1 0 1 0

a is good if the match started either 2 characters ago or on the current character.

d a b a b
1 0 1 0 1

b is good if the match started either 1 or 3 characters ago. The 0 in bit 3 means that we've almost matched the whole pattern...

a b a b a
1 1 0 1 0

...but the next character is a, which is no good if the match started 4 characters ago. However, shorter matches might still be good.

b a b a b
1 0 1 0 1

Still looking good.

a b a b c
0 1 1 1 1

Finally, c is good if the match started 4 characters before. The fact that a 0 has made it all the way to the top bit means that we have a match.

like image 132
Matthew Slattery Avatar answered Sep 21 '22 07:09

Matthew Slattery


Sorry for not allowing anyone else to answer, but I'm pretty sure I've figured it out now.

The concept essential for groking the algorithm is the representation of match states (defined in the original post) in binary. The article in the original post explains it formally; I'll try my hand at doing so colloquially:

Let's have STR, which is a String created with characters from a given alphabet.

Let's represent STR with a set of binary digits: STR_BINARY. The algorithm requires for this representation to be backwards (so, the first letter corresponds to the last digit, second letter with the second-to-last digit, etc.).

Let's assume RANDOM refers to a String with random characters from the same alphabet STR is created from.

In STR_BINARY, a 0 at a given index indicates that, RANDOM matches STR from STR[0] to

STR[(index of letter in STR that the 0 in STR_BINARY corresponds to)]. Empty spaces count as matches. A 1 indicates that RANDOM does not match STR in inside those same boundaries.

The algorithm becomes simpler to learn once this is understood.

like image 20
Kevin Avatar answered Sep 23 '22 07:09

Kevin