I have been trying to understand shift rules in Boyer–Moore string search algorithm but haven't understood them. I read here on wikipedia but that is too complex ! It will be of great help if someone lists the rule in a simple manner.

In the Boyer-Moore algorithm, you start comparing pattern characters to text characters from the end of the pattern. If you find a mismatch, you have a configuration of the type <pre class="prettyprint"><code>....xyzabc.... <-text ....uabc <- pattern ^ mismatch </code></pre> Now the bad character shift means to shift the pattern so that the text character of the mismatch is aligned to the last occurrence of that character in the initial part of the pattern (pattern minus last pattern character), if there is such an occurrence, or one position before the pattern if the mismatched character doesn't appear in the initial part of the pattern at all. That could be a shift to the left, if the situation is <pre class="prettyprint"><code> v ...xyzazc... ....uazc ..uazc </code></pre> so that alone doesn't guarantee a progress. The other shift, the good suffix shift, aligns the matched part of the text, <code>m</code>, with the rightmost occurrence of that character sequence in the pattern that is preceded by a different character (including none, if the matched suffix is also a prefix of the pattern) than the matched suffix <code>m</code> of the pattern - if there is such an occurrence. So for example <pre class="prettyprint"><code> v ....abcdabceabcfabc... ...xabcfabcfabc ...xabcfabcfabc </code></pre> would lead to a good suffix shift of four positions, since the matched part <code>m = abcfabc</code> occurs in the pattern four places left of its suffix-occurrence and is preceded by a different character there (<code>x</code> instead of <code>f</code>) than in the suffix position. If there is no complete occurrence of the matched part in the pattern preceded by a different character than the suffix, the good suffix shift aligns a suffix of the matched part of the text with a prefix of the pattern, choosing maximal overlap, e.g. <pre class="prettyprint"><code> v ...robocab.... abacab abacab </code></pre> The good suffix shift always shifts the pattern to the right, so guarantees progress. Then, on every mismatch the advances of the bad character shift and the good suffix shift are compared, and the greater is chosen. It is explained in greater depth by Christian Charras and Thierry Lecroq here, along with many other string searching algorithms. <hr> For the example you mentioned in the comments, <pre class="prettyprint"><code>SSIMPLE EXAMPLE EXAMPLE ^ </code></pre> the matched suffix is <code>MPLE</code>, and the mismatched text character is <code>I</code>. So the bad character shift looks for the last occurrence of <code>I</code> in the initial part of the pattern. There is none, so that bad character shift would shift the pattern so that the mismatched <code>I</code> is one before the start of the pattern <pre class="prettyprint"><code>SSIMPLE EXAMPLE EXAMPLE </code></pre> and the good suffix shift looks for the rightmost occurrence of <code>MPLE</code> in the pattern not preceded by an <code>A</code>, or the longest suffix of <code>MPLE</code> that is a prefix of the pattern. There is no complete occurrence of the matched part in the pattern before the suffix, so the longest suffix of the matched part that is also a prefix of the pattern determines the good suffix shift. In this case, the two suffixes of the matched part that are prefixes of the pattern are the single-character string <code>E</code>, and the empty string. The longest is obviously the nonempty string, so the good suffix shift aligns the one-character suffix <code>E</code> in the matched part of the text with the one-character prefix of the pattern <pre class="prettyprint"><code>SSIMPLE EXAMPLE EXAMPLE </code></pre> The good suffix shift shifts the pattern farther right, so that is the chosen shift. Then there is an immediate mismatch at the last pattern position, and then the bad character shift aligns the <code>P</code> in the text with the <code>P</code> in the pattern (and the good suffix shift need not be considered at all if the mismatch occurs at the last pattern character, since in that case, it would never produce a larger shift than the bad character shift). Then we have the complete match. In the variant with the pattern <code>TXAMPLE</code>, the good suffix shift finds that no non-empty suffix of the matched part is a prefix of the pattern (and there is no occurrence of the complete matched part in the pattern not preceded by <code>A</code>), so the good suffix shift aligns the empty suffix of the matched part of the text (the boundary between the <code>E</code> and the space) with the empty prefix of the pattern (the empty string preceding the <code>T</code>), resulting in <pre class="prettyprint"><code>SSIMPLE EXAMPLE TXAMPLE </code></pre> (then in the next step, the bad character shift aligns the two <code>L</code>s, and the next mismatch in the step thereafter occurs at the initial <code>T</code> of the pattern).

What are the shift rules for Boyer–Moore string search algorithm?

2 Answers

In the Boyer-Moore algorithm, you start comparing pattern characters to text characters from the end of the pattern. If you find a mismatch, you have a configuration of the type

....xyzabc....      <-text
  ....uabc          <- pattern
      ^
    mismatch

Now the bad character shift means to shift the pattern so that the text character of the mismatch is aligned to the last occurrence of that character in the initial part of the pattern (pattern minus last pattern character), if there is such an occurrence, or one position before the pattern if the mismatched character doesn't appear in the initial part of the pattern at all.

That could be a shift to the left, if the situation is

     v
...xyzazc...
 ....uazc
 ..uazc

so that alone doesn't guarantee a progress.

The other shift, the good suffix shift, aligns the matched part of the text, m, with the rightmost occurrence of that character sequence in the pattern that is preceded by a different character (including none, if the matched suffix is also a prefix of the pattern) than the matched suffix m of the pattern - if there is such an occurrence.

So for example

           v
....abcdabceabcfabc...
    ...xabcfabcfabc
        ...xabcfabcfabc

would lead to a good suffix shift of four positions, since the matched part m = abcfabc occurs in the pattern four places left of its suffix-occurrence and is preceded by a different character there (x instead of f) than in the suffix position.

If there is no complete occurrence of the matched part in the pattern preceded by a different character than the suffix, the good suffix shift aligns a suffix of the matched part of the text with a prefix of the pattern, choosing maximal overlap, e.g.

      v
...robocab....
    abacab
        abacab

The good suffix shift always shifts the pattern to the right, so guarantees progress.

Then, on every mismatch the advances of the bad character shift and the good suffix shift are compared, and the greater is chosen. It is explained in greater depth by Christian Charras and Thierry Lecroq here, along with many other string searching algorithms.

For the example you mentioned in the comments,

SSIMPLE EXAMPLE
EXAMPLE
  ^

the matched suffix is MPLE, and the mismatched text character is I. So the bad character shift looks for the last occurrence of I in the initial part of the pattern. There is none, so that bad character shift would shift the pattern so that the mismatched I is one before the start of the pattern

SSIMPLE EXAMPLE
   EXAMPLE

and the good suffix shift looks for the rightmost occurrence of MPLE in the pattern not preceded by an A, or the longest suffix of MPLE that is a prefix of the pattern. There is no complete occurrence of the matched part in the pattern before the suffix, so the longest suffix of the matched part that is also a prefix of the pattern determines the good suffix shift. In this case, the two suffixes of the matched part that are prefixes of the pattern are the single-character string E, and the empty string. The longest is obviously the nonempty string, so the good suffix shift aligns the one-character suffix E in the matched part of the text with the one-character prefix of the pattern

SSIMPLE EXAMPLE
      EXAMPLE

The good suffix shift shifts the pattern farther right, so that is the chosen shift.

Then there is an immediate mismatch at the last pattern position, and then the bad character shift aligns the P in the text with the P in the pattern (and the good suffix shift need not be considered at all if the mismatch occurs at the last pattern character, since in that case, it would never produce a larger shift than the bad character shift).

Then we have the complete match.

In the variant with the pattern TXAMPLE, the good suffix shift finds that no non-empty suffix of the matched part is a prefix of the pattern (and there is no occurrence of the complete matched part in the pattern not preceded by A), so the good suffix shift aligns the empty suffix of the matched part of the text (the boundary between the E and the space) with the empty prefix of the pattern (the empty string preceding the T), resulting in

SSIMPLE EXAMPLE
       TXAMPLE

(then in the next step, the bad character shift aligns the two Ls, and the next mismatch in the step thereafter occurs at the initial T of the pattern).

answered Nov 07 '22 02:11

Daniel Fischer

There's a good visualization here.

(EDIT: There's also a very good explanation with both examples and an example of how to implement the preprocessing steps here.)

General rules:

We're looking for how to align the pattern with the text so that the aligned parts match. If no such alignment exists, the pattern isn't found in the text.
Check each alignment from right to left - that is, start by checking if the last character of the pattern matches its current alignment.
When you hit a character that doesn't align, increase the offset (shift the pattern) so that the last occurrence of the text-side letter in the pattern is aligned with this occurrence of the text-side letter we're currently looking at. This requires pre-building (or searching each time, but that's less efficient) an index of where each letter exists in the pattern.
If the character being considered in the text doesn't appear in the pattern, skip forward by the full length of the pattern.
If the end of the pattern sticks out past the end of the text (offset + length(pattern) > length(text)), the pattern doesn't appear in the text.

What I've just described is the "bad character" rule. The "good suffix" rule gives another option for shifting; whichever shifts farther is the one you should take. It's entirely possible to implement the algorithm without the good suffix rule, but it will be less efficient once the indices are built up.

The good-suffix rule requires that you also know where to find each multi-character substring of the pattern. When you hit a mismatch (checking, as always, from right to left), the good-suffix shift moves the pattern to a point where the letters that did already match will do so again. Alternatively, if the part that matched is unique in the pattern, you know you can skip all the way past it, because if it didn't match when lined up with the sole occurrence, it can't possibly match when lined up with any other part of the pattern.

For example, let's consider the following situation:

My pattern ends in "a dog".
I currently have it aligned with a part of the text that ends in "s dog".
Therefore, the bad letter is 's' (where they stop matching), and the good suffix is " dog" (the part that did match).

I have two options here:

Shift so that the first 's' (from right to left) in the pattern is aligned with the 's' in the text. If there is no 's' in the pattern, shift the beginning of the pattern to just past the 's'.
Shift so that the next " dog" is aligned with the " dog" in the text. If there isn't another " dog" in the pattern, shift the beginning of the pattern to just past the end of " dog".

and I should take whichever one lets me shift farther.

If you're still confused, try asking a more specific question; it's hard to be clear when we don't know where you're stuck.

answered Nov 07 '22 02:11

jfmatt

Related questions
                            
                                How to generate random vertices to form a convex polygon in c++?
                            
                                Converting from decimal to hexadecimal
                            
                                Maximum xor among all subsets of an array
                            
                                How to code these conditional statements in more elegant & scalable manner
                            
                                Is there a Binary Search method in the C standard library?
                            
                                Graph drawing algorithms - I'm trying to render finite state automata
                            
                                Building or Finding a "relevant terms" suggestion feature
                            
                                How can I sort a coordinate list for a rectangle counterclockwise?
                            
                                Average of two strings in alphabetical/lexicographical order
                            
                                Generate alphanumeric strings sequentially
                            
                                find median in O(log n)
                            
                                Sorting objects according to a specific rule
                            
                                Given boundaries, find interval
                            
                                Finding unique (as in only occurring once) element haskell
                            
                                Closest Palindrome Number
                            
                                Left Rotation on an Array
                            
                                How would you display an array of integers as a set of ranges? (algorithm)
                            
                                How to transform mouse location in isometric tiling map?
                            
                                Behaviour of SecureRandom
                            
                                Why is there a separation of algorithms, iterators and containers in C++ STL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are the shift rules for Boyer–Moore string search algorithm?

Tags:

algorithm

string-search

boyer-moore

saplingPro

People also ask

2 Answers

Daniel Fischer

jfmatt

Recent Activity

Donate For Us