I am facing issues in understanding Boyer Moore String Search algorithm. I am following the following document. Link I am not able to work out my way as to exactly what is the real meaning of delta1 and delta2 here, and how are they applying this to find string search algorithm. Language looked little vague.. Kindly if anybody out there can help me out in understanding this, it would be really helpful. Or, if you know of any other link or document available that is easy to understand, then please share. Thanks in advance.

The insight behind Boyer-Moore is that if you start searching for a pattern in a string starting with the last character in the pattern, you can jump your search forward multiple characters when you hit a mismatch. Let's say our pattern <code>p</code> is the sequence of characters <code>p1</code>, <code>p2</code>, ..., <code>pn</code> and we are searching a string <code>s</code>, currently with <code>p</code> aligned so that <code>pn</code> is at index <code>i</code> in <code>s</code>. E.g.: <pre class="prettyprint"><code>s = WHICH FINALLY HALTS. AT THAT POINT... p = AT THAT i = ^ </code></pre> The B-M paper makes the following observations: (1) if we try matching a character that is not in <code>p</code> then we can jump forward <code>n</code> characters: 'F' is not in <code>p</code>, hence we advance <code>n</code> characters: <pre class="prettyprint"><code>s = WHICH FINALLY HALTS. AT THAT POINT... p = AT THAT i = ^ </code></pre> (2) if we try matching a character whose last position is <code>k</code> from the end of <code>p</code> then we can jump forward <code>k</code> characters: ' 's last position in <code>p</code> is 4 from the end, hence we advance 4 characters: <pre class="prettyprint"><code>s = WHICH FINALLY HALTS. AT THAT POINT... p = AT THAT i = ^ </code></pre> Now we scan backwards from <code>i</code> until we either succeed or we hit a mismatch. (3a) if the mismatch occurs <code>k</code> characters from the start of <code>p</code> and the mismatched character is not in <code>p</code>, then we can advance (at least) <code>k</code> characters. 'L' is not in <code>p</code> and the mismatch occurred against <code>p6</code>, hence we can advance (at least) 6 characters: <pre class="prettyprint"><code>s = WHICH FINALLY HALTS. AT THAT POINT... p = AT THAT i = ^ </code></pre> However, we can actually do better than this. (3b) since we know that at the old <code>i</code> we'd already matched some characters (1 in this case). If the matched characters don't match the start of <code>p</code>, then we can actually jump forward a little more (this extra distance is called 'delta2' in the paper): <pre class="prettyprint"><code>s = WHICH FINALLY HALTS. AT THAT POINT... p = AT THAT i = ^ </code></pre> At this point, observation (2) applies again, giving <pre class="prettyprint"><code>s = WHICH FINALLY HALTS. AT THAT POINT... p = AT THAT i = ^ </code></pre> and bingo! We're done.

Boyer Moore Algorithm Understanding and Example?

2 Answers

The insight behind Boyer-Moore is that if you start searching for a pattern in a string starting with the last character in the pattern, you can jump your search forward multiple characters when you hit a mismatch.

Let's say our pattern p is the sequence of characters p1, p2, ..., pn and we are searching a string s, currently with p aligned so that pn is at index i in s.

E.g.:

s = WHICH FINALLY HALTS.  AT THAT POINT...
p = AT THAT
i =       ^

The B-M paper makes the following observations:

(1) if we try matching a character that is not in p then we can jump forward n characters:

'F' is not in p, hence we advance n characters:

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =        AT THAT
i =              ^

(2) if we try matching a character whose last position is k from the end of p then we can jump forward k characters:

' 's last position in p is 4 from the end, hence we advance 4 characters:

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =            AT THAT
i =                  ^

Now we scan backwards from i until we either succeed or we hit a mismatch. (3a) if the mismatch occurs k characters from the start of p and the mismatched character is not in p, then we can advance (at least) k characters.

'L' is not in p and the mismatch occurred against p6, hence we can advance (at least) 6 characters:

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =                  AT THAT
i =                        ^

However, we can actually do better than this. (3b) since we know that at the old i we'd already matched some characters (1 in this case). If the matched characters don't match the start of p, then we can actually jump forward a little more (this extra distance is called 'delta2' in the paper):

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =                   AT THAT
i =                         ^

At this point, observation (2) applies again, giving

s = WHICH FINALLY HALTS.  AT THAT POINT...
p =                       AT THAT
i =                             ^

and bingo! We're done.

158

answered Oct 07 '22 23:10

Rafe

The algorithm is based on a simple principle. Suppose that I'm trying to match a substring of length m. I'm going to first look at character at index m. If that character is not in my string, I know that the substring I want can't start in characters at indices 1, 2, ... , m.

If that character is in my string, I'll assume that it is at the last place in my string that it can be. I'll then jump back and start trying to match my string from that possible starting place. This piece of information is my first table.

Once I start matching from the beginning of the substring, when I find a mismatch, I can't just start from scratch. I could be partially through a match starting at a different point. For instance if I'm trying to match anand in ananand successfully match, anan, realize that the following a is not a d, but I've just matched an, and so I should jump back to trying to match my third character in my substring. This, "If I fail after matching x characters, I could be on the y'th character of a match" information is stored in the second table.

Note that when I fail to match the second table knows how far along in a match I might be based on what I just matched. The first table knows how far back I might be based on the character that I just saw which I failed to match. You want to use the more pessimistic of those two pieces of information.

With this in mind the algorithm works like this:

start at beginning of string
start at beginning of match
while not at the end of the string:
    if match_position is 0:
        Jump ahead m characters
        Look at character, jump back based on table 1
        If match the first character:
            advance match position
        advance string position
    else if I match:
        if I reached the end of the match:
           FOUND MATCH - return
        else:
           advance string position and match position.
    else:
        pos1 = table1[ character I failed to match ]
        pos2 = table2[ how far into the match I am ]
        if pos1 < pos2:
            jump back pos1 in string
            set match position at beginning
        else:
            set match position to pos2
FAILED TO MATCH

answered Oct 07 '22 22:10

btilly

Related questions
                            
                                How to find pythagorean triplets in an array faster than O(N^2)?
                            
                                Hot content algorithm / score with time decay
                            
                                Clustering Algorithm for Paper Boys
                            
                                Direct formula for summing XOR
                            
                                Time Complexity of two for loops [duplicate]
                            
                                What's a nice method to factor gaussian integers?
                            
                                Speeding up pairing of strings into objects in Python
                            
                                When will the worst case of Merge Sort occur?
                            
                                C# implementation of Google's 'Encoded Polyline Algorithm'
                            
                                Finding if a Binary Tree is a Binary Search Tree [duplicate]
                            
                                Building bridges problem - how to apply longest increasing subsequence?
                            
                                How can I generate truly (not pseudo) random numbers with C#?
                            
                                What is the fastest way to find Nth biggest number of an INT array?
                            
                                Algorithm to find k smallest numbers in array of n items
                            
                                All factors of a given number
                            
                                Changing integer to binary string of digits
                            
                                How to calculate a standard deviation [array] [duplicate]
                            
                                Given a list of numbers and a number k, return whether any two numbers from the list add up to k
                            
                                Help me understand Inorder Traversal without using recursion
                            
                                How to change a negative number to zero in python without using decision structures

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Boyer Moore Algorithm Understanding and Example?

Tags:

algorithm

string-search

AGeek

People also ask

2 Answers

Rafe

btilly

Recent Activity

Donate For Us