Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding Knuth-Morris-Pratt Algorithm

Can someone explain this to me? I've been reading about it and it still is hard to follow.

text : ababdbaababa
pattern: ababa

table for ababa is -1 0 0 1 2.

I think I understand how the table is constructed but, I dont understand how to shift once mismatch has occurred. Seems like we dont even use the table when shifting?

when do we use the table?

like image 549
antz Avatar asked Nov 07 '12 14:11

antz


3 Answers

Here I have briefly described computing the prefix function and shifting through the text here.

enter image description here

For further information: Knuth–Morris–Pratt string search algorithm

Shifting through the text :

Text:     ABC ABCDAB ABCDABCDABDE
Pattern : ABCDABD

Scenario 1 - There is/are some matching character/s in Pattern and Text.
e.g 1: In here there are 3 matching characters.

enter image description here

Get the value from table for 3 characters. (index 2, ABC) i.e 0 Therefore shift = 3 - 0 i.e 3

enter image description here

e.g 2: In here there are 6 matching characters.

enter image description here

Get the value from table for 6 characters. (index 5, ABCDAB) i.e 2 Therefore shift = 6 - 2 i.e 4

enter image description here

Scenario 2 - If there is no matching characters then shift by one.

like image 93
Rukmal Dias Avatar answered Nov 16 '22 23:11

Rukmal Dias


the table is used when your mismatch occurs. Let's apply the pattern to your text:

You start matching text with pattern and test if your pattern could be in text, starting at the first position. You compare text[1] with pattern[1] and that turns out to be a match. You do the same for text[2], text[3] and text[4].

when you want to match text[5] with pattern[5] you don't have a match (d<>a). You then know that your pattern will not start at the first position. You could then start the matching all over again for position 2 but that is not efficient. You can use the table now.

The error occured at pattern[5] so you go to table[5] which is 2. That tells you that you can start matching at the current position again with 2 already matched characters. Instead of having to start matching position 2, you can start at your previous position (1) + table[5] (2)=3. Indeed, If we look at text[3] and text[4], we see that it is equal to pattern[1] and pattern[2], respectivily.

The numbers in table tell you how many positions are already matched when an error occurs. In this case 2 characters of the next pattern were already matched. You can then immediately start matching for position 3 and skip position 2 (as the pattern can not be found starting at position[2]).

like image 24
Origin Avatar answered Nov 16 '22 23:11

Origin


Well this is an old topic but hopefully someone who searches for this in the future will see it. Answer given above is good but I worked through an example myself to see what's going on exactly.

First part of the exposition is taken from wiki, the part I really wanted to elaborate on is how this backtracking array is constructed.

Here goes:

we work through a (relatively artificial) run of the algorithm, where

W = "ABCDABD" and 
S = "ABC ABCDAB ABCDABCDABDE". 

At any given time, the algorithm is in a state determined by two integers:

m which denotes the position within S which is the beginning of a prospective match for W

i the index in W denoting the character currently under consideration.

In each step we compare S[m+i] with W[i] and advance if they are equal. This is depicted, at the start of the run, like

              1         2  
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456

We proceed by comparing successive characters of W to "parallel" characters of S, moving from one to the next if they match. However, in the fourth step, we get S[3] is a space and W[3] = 'D', a mismatch. Rather than beginning to search again at S[1], we note that no 'A' occurs between positions 0 and 3 in S except at 0; hence, having checked all those characters previously, we know there is no chance of finding the beginning of a match if we check them again. Therefore we move on to the next character, setting m = 4 and i = 0.

              1         2  
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W:     ABCDABD
i:     0123456

We quickly obtain a nearly complete match "ABCDAB" when, at W[6] (S[10]), we again have a discrepancy. However, just prior to the end of the current partial match, we passed an "AB" which could be the beginning of a new match, so we must take this into consideration. As we already know that these characters match the two characters prior to the current position, we need not check them again; we simply reset m = 8, i = 2 and continue matching the current character. Thus, not only do we omit previously matched characters of S, but also previously matched characters of W.

              1         2  
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W:         ABCDABD
i:         0123456

This search fails immediately, however, as the pattern still does not contain a space, so as in the first trial, we return to the beginning of W and begin searching at the next character of S: m = 11, reset i = 0.

              1         2  
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W:            ABCDABD
i:            0123456

Once again we immediately hit upon a match "ABCDAB" but the next character, 'C', does not match the final character 'D' of the word W. Reasoning as before, we set m = 15, to start at the two-character string "AB" leading up to the current position, set i = 2, and continue matching from the current position.

              1         2  
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W:                ABCDABD
i:                0123456

This time we are able to complete the match, whose first character is S[15].

The above example contains all the elements of the algorithm. For the moment, we assume the existence of a "partial match" table T, described below, which indicates where we need to look for the start of a new match in the event that the current one ends in a mismatch. The entries of T are constructed so that if we have a match starting at S[m] that fails when comparing S[m + i] to W[i], then the next possible match will start at index m + i - T[i] in S (that is, T[i] is the amount of "backtracking" we need to do after a mismatch). This has two implications: first, T[0] = -1, which indicates that if W[0] is a mismatch, we cannot backtrack and must simply check the next character; and second, although the next possible match will begin at index m + i - T[i], as in the example above, we need not actually check any of the T[i] characters after that, so that we continue searching from W[T[i]].

BACKTRACKING ARRAY CONSTRUCTION:

so this backtracking array T[] we will call lps[], let's see how we calculate this guy

lps[i] = the longest proper prefix of pat[0..i] 
            which is also a suffix of pat[0..i].

Examples: For the pattern “AABAACAABAA”,

lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]

//so just going through this real quick

 lps[0] is just 0 by default
 lps[1] is 1 because it's looking at AA and A is both a prefix and suffix
 lps[2] is 0 because it's looking at AAB and suffix is B but there is no prefix equal to B unless you count B itself which I guess is against the rules
 lps[3] is 1 because it's looking at AABA and first A matches last A
 lps[4] is 2 becuase it's looking at AABAA and first 2 A matches last 2 A
 lps[5] is 0 becuase it's looking at AABAAC and nothing matches C
 ...


 For the pattern “ABCDE”, lps[] is [0, 0, 0, 0, 0]
 For the pattern “AAAAA”, lps[] is [0, 1, 2, 3, 4]
 For the pattern “AAABAAA”, lps[] is [0, 1, 2, 0, 1, 2, 3]
 For the pattern “AAACAAAAAC”, lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4]

And this totally makes sense if you think about it...if you mismatch, you want to go back as far as you can obviously, how far back you go (the suffix portion) is essentially the prefix since you must start matching from the first character again by definition. so if your string looks like

aaaaaaaaaaaaaaa..b..aaaaaaaaaaaaaaac and you mismatche on the last char c, then you want to reuse aaaaaaaaaaaaaaa as your new head, just think it through

like image 10
tweaking Avatar answered Nov 17 '22 01:11

tweaking