String similarity: how exactly does Bitap work?

Tags:

I'm trying to wrap my head around the Bitap algorithm, but am having trouble understanding the reasons behind the steps of the algorithm.

I understand the basic premise of the algorithm, which is (correct me if i'm wrong):

Two strings:     PATTERN (the desired string)
                 TEXT (the String to be perused for the presence of PATTERN)

Two indices:     i (currently processing index in PATTERN), 1 <= i < PATTERN.SIZE
                 j (arbitrary index in TEXT)

Match state S(x): S(PATTERN(i)) = S(PATTERN(i-1)) && PATTERN[i] == TEXT[j], S(0) = 1

In english terms, PATTERN.substring(0,i) matches a substring of TEXT if the previous substring PATTERN.substring(0, i-1) was successfully matched and the character at PATTERN[i] is the same as the character at TEXT[j].

What I don't understand is the bit-shifting implementation of this. The official paper detailing this algorithm basically lays it out, but I can't seem to visualize what's supposed to go on. The algorithm specification is only the first 2 pages of the paper, but I'll highlight the important parts:

Here is the bit-shifting version of the concept:

enter image description here

Here is T[text] for a sample search string:

enter image description here

And here is a trace of the algorithm.

enter image description here

Specifically, I don't understand what the T table signifies, and the reason behind ORing an entry in it with the current state.

I'd be grateful if anyone can help me understand what exactly is going on

381

asked Jul 03 '12 19:07

Kevin

2 Answers

T is slightly confusing because you would normally number positions in the pattern from left to right:

0 1 2 3 4
a b a b c

...whereas bits are normally numbered from right to left.

But writing the pattern backwards above the bits makes it clear:

  bit: 4 3 2 1 0

       c b a b a
T[a] = 1 1 0 1 0

       c b a b a
T[b] = 1 0 1 0 1

       c b a b a
T[c] = 0 1 1 1 1

       c b a b a
T[d] = 1 1 1 1 1

Bit n of T[x] is 0 if x appears in position n, or 1 if it does not.

Equivalently, you can think of this as saying that if the current character in the input string is x, and you see a 0 in position n of T[x], then you can only possibly be matching the pattern if the match started n characters previously.

Now to the matching procedure. A 0 in bit n of the state means that we started matching the pattern n characters ago (where 0 is the current character). Initially, nothing matches.

  [start]
1 1 1 1 1

As we consume characters trying to match, the state is shifted left (which shifts a zero in to the bottom bit, bit 0) and OR-ed with the table entry for the current character. The first character is a; shifting left and OR-ing in T[a] gives:

        a
1 1 1 1 0

The 0 bit that was shifted in is preserved, because a current character of a can begin a match of the pattern. For any other character, the bit would be have been set to 1.

The fact that bit 0 of the state is now 0 means that we started matching the pattern on the current character; continuing, we get:

      a b
1 1 1 0 1

...because the 0 bit has been shifted left - think of it as saying that we started matching the pattern 1 character ago - and T[b] has a 0 in the same position, telling us that a seeing a b in the current position is good if we started matching 1 character ago.

    a b d
1 1 1 1 1

d can't match anywhere; all the bits get set back to 1.

  a b d a
1 1 1 1 0

As before.

a b d a b
1 1 1 0 1

As before.

b d a b a
1 1 0 1 0

a is good if the match started either 2 characters ago or on the current character.

d a b a b
1 0 1 0 1

b is good if the match started either 1 or 3 characters ago. The 0 in bit 3 means that we've almost matched the whole pattern...

a b a b a
1 1 0 1 0

...but the next character is a, which is no good if the match started 4 characters ago. However, shorter matches might still be good.

b a b a b
1 0 1 0 1

Still looking good.

a b a b c
0 1 1 1 1

Finally, c is good if the match started 4 characters before. The fact that a 0 has made it all the way to the top bit means that we have a match.

132

answered Sep 21 '22 07:09

Matthew Slattery

Sorry for not allowing anyone else to answer, but I'm pretty sure I've figured it out now.

The concept essential for groking the algorithm is the representation of match states (defined in the original post) in binary. The article in the original post explains it formally; I'll try my hand at doing so colloquially:

Let's have STR, which is a String created with characters from a given alphabet.

Let's represent STR with a set of binary digits: STR_BINARY. The algorithm requires for this representation to be backwards (so, the first letter corresponds to the last digit, second letter with the second-to-last digit, etc.).

Let's assume RANDOM refers to a String with random characters from the same alphabet STR is created from.

In STR_BINARY, a 0 at a given index indicates that, RANDOM matches STR from STR[0] to

STR[(index of letter in STR that the 0 in STR_BINARY corresponds to)]. Empty spaces count as matches. A 1 indicates that RANDOM does not match STR in inside those same boundaries.

The algorithm becomes simpler to learn once this is understood.

answered Sep 23 '22 07:09

Kevin

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

String similarity: how exactly does Bitap work?

Tags:

Kevin

People also ask

2 Answers

Matthew Slattery

Kevin

Recent Activity

Donate For Us

String similarity: how exactly does Bitap work?

Tags:

Kevin

People also ask

2 Answers

Matthew Slattery

Kevin

Related questions

Recent Activity

Donate For Us