I'm trying to wrap my head around the Bitap algorithm, but am having trouble understanding the reasons behind the steps of the algorithm.
I understand the basic premise of the algorithm, which is (correct me if i'm wrong):
Two strings: PATTERN (the desired string)
TEXT (the String to be perused for the presence of PATTERN)
Two indices: i (currently processing index in PATTERN), 1 <= i < PATTERN.SIZE
j (arbitrary index in TEXT)
Match state S(x): S(PATTERN(i)) = S(PATTERN(i-1)) && PATTERN[i] == TEXT[j], S(0) = 1
In english terms, PATTERN.substring(0,i)
matches a substring of TEXT if the previous substring PATTERN.substring(0, i-1)
was successfully matched and the character at PATTERN[i]
is the same as the character at TEXT[j]
.
What I don't understand is the bit-shifting implementation of this. The official paper detailing this algorithm basically lays it out, but I can't seem to visualize what's supposed to go on. The algorithm specification is only the first 2 pages of the paper, but I'll highlight the important parts:
Here is the bit-shifting version of the concept:
Here is T[text] for a sample search string:
And here is a trace of the algorithm.
Specifically, I don't understand what the T table signifies, and the reason behind OR
ing an entry in it with the current state.
I'd be grateful if anyone can help me understand what exactly is going on
The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance – if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal.
String matching algorithms have greatly influenced computer science and play an essential role in various real-world problems. It helps in performing time-efficient tasks in multiple domains. These algorithms are useful in the case of searching a string within another string.
T
is slightly confusing because you would normally number positions in the
pattern from left to right:
0 1 2 3 4
a b a b c
...whereas bits are normally numbered from right to left.
But writing the pattern backwards above the bits makes it clear:
bit: 4 3 2 1 0 c b a b a T[a] = 1 1 0 1 0 c b a b a T[b] = 1 0 1 0 1 c b a b a T[c] = 0 1 1 1 1 c b a b a T[d] = 1 1 1 1 1
Bit n of T[x]
is 0
if x
appears in position n, or 1
if it does not.
Equivalently, you can think of this as saying that if the current character
in the input string is x
, and you see a 0
in position n of T[x]
, then you
can only possibly be matching the pattern if the match started n characters
previously.
Now to the matching procedure. A 0
in bit n of the state means that we started matching the pattern n characters ago (where 0 is the current character). Initially, nothing matches.
[start]
1 1 1 1 1
As we consume characters trying to match, the state is shifted left (which shifts a zero in
to the bottom bit, bit 0) and OR-ed with the table entry for the current character. The first character is a
; shifting left and OR-ing in T[a]
gives:
a
1 1 1 1 0
The 0
bit that was shifted in is preserved, because a current character of a
can
begin a match of the pattern. For any other character, the bit would be have been set to
1
.
The fact that bit 0 of the state is now 0
means that we started matching the pattern on
the current character; continuing, we get:
a b
1 1 1 0 1
...because the 0
bit has been shifted left - think of it as saying that we started matching the pattern 1 character ago - and T[b]
has a 0
in the same position, telling
us that a seeing a b
in the current position is good if we started matching 1 character
ago.
a b d
1 1 1 1 1
d
can't match anywhere; all the bits get set back to 1
.
a b d a
1 1 1 1 0
As before.
a b d a b
1 1 1 0 1
As before.
b d a b a
1 1 0 1 0
a
is good if the match started either 2 characters ago or on the current character.
d a b a b
1 0 1 0 1
b
is good if the match started either 1 or 3 characters ago. The 0
in bit 3 means
that we've almost matched the whole pattern...
a b a b a
1 1 0 1 0
...but the next character is a
, which is no good if the match started 4 characters
ago. However, shorter matches might still be good.
b a b a b
1 0 1 0 1
Still looking good.
a b a b c
0 1 1 1 1
Finally, c
is good if the match started 4 characters before. The fact that
a 0
has made it all the way to the top bit means that we have a match.
Sorry for not allowing anyone else to answer, but I'm pretty sure I've figured it out now.
The concept essential for groking the algorithm is the representation of match states (defined in the original post) in binary. The article in the original post explains it formally; I'll try my hand at doing so colloquially:
Let's have STR
, which is a String created with characters from a given alphabet.
Let's represent STR
with a set of binary digits: STR_BINARY
. The algorithm requires for this representation to be backwards (so, the first letter corresponds to the last digit, second letter with the second-to-last digit, etc.).
Let's assume RANDOM
refers to a String with random characters from the same alphabet STR
is created from.
In STR_BINARY
, a 0 at a given index indicates that, RANDOM
matches STR
from STR[0]
to
STR[(index of letter in STR that the 0 in STR_BINARY corresponds to)]
. Empty spaces count as matches. A 1 indicates that RANDOM
does not match STR
in inside those same boundaries.
The algorithm becomes simpler to learn once this is understood.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With