I have been trying to understand KMP algorithm. Still I didn't get the clear understanding of reasoning behind kmp algorithm. Suppose my text is bacbababaabcbab
and pattern is abababca
. By using the rule of length of longest proper prefix of sub(pattern)
that matches the proper suffix of sub(pattern)
, I filled my table[]
.
a b a b a b c a
0 0 1 2 3 4 0 1
Now I started applying KMP algorithm on the text with my pattern and table.
After coming to index 4 of above text, we have a match of length(l)=5;
by looking at table[l-1]=3;
As per KMP algorithm we can skip length up to 2 chars and can continue .
bacbababaabcbab
----xx|||
abababca
Here I am not getting the logic behind shifting. Why should we shift? Can somebody please clarify my confusion?
The KMP algorithm is a solution to the string search problem wherein we are required to find if a given pattern string occurs in another main string. It is one of the advanced string matching algorithm that was conceived by Donald Knuth, James H. Morris and Vaughan Pratt, hence the name "KMP algorithm".
The KMP algorithm was the first-ever string matching algorithm that ran in linear time. Most of the naive string matching algorithms run in O(nm) time, while the KMP algorithm runs in O(m + n) time where n is the length of the string, and m is the length of the pattern.
The KMP matching algorithm uses degenerating property (pattern having same sub-patterns appearing more than once in the pattern) of the pattern and improves the worst case complexity to O(n).
KMP spends a little time precomputing a table (on the order of the size of W[] , O(k)), and then it uses that table to do an efficient search of the string in O(n). The difference is that KMP makes use of previous match information that the straightforward algorithm does not.
To understand the logic behind the KMP algorithm , you should first understand , how this KMP algo is better than brute-force algorithm .
Idea
After a shift of the pattern, the naive algorithm has forgotten all information about previously matched symbols. So it is possible that it re-compares a text symbol with different pattern symbols again and again. This leads to its worst case complexity of Θ(nm) (n: length of the text, m: length of the pattern).
The algorithm of Knuth, Morris and Pratt [KMP 77] makes use of the information gained by previous symbol comparisons. It never re-compares a text symbol that has matched a pattern symbol. As a result, the complexity of the searching phase of the Knuth-Morris-Pratt algorithm is in O(n).
However, a preprocessing of the pattern is necessary in order to analyze its structure. The preprocessing phase has a complexity of O(m). Since m<=n, the overall complexity of the Knuth-Morris-Pratt algorithm is in O(n).
text :bacbababaabcbab pattern:abababca
In brute-force method , Slide the pattern over text one by one and check for a match. If a match is found, then slides by 1 again to check for subsequent matches .
void search(char *pat, char *txt)
{
int M = strlen(pat);
int N = strlen(txt);
/* A loop to slide pat[] one by one */
for (int i = 0; i <= N - M; i++)
{
int j;
/* For current index i, check for pattern match */
for (j = 0; j < M; j++)
{
if (txt[i+j] != pat[j])
break;
}
if (j == M) // if pat[0...M-1] = txt[i, i+1, ...i+M-1]
{
printf("Pattern found at index %d \n", i);
}
}
}
The complexity of above algorithm is O(nm). In the above algorithm we never used comparison data we processed,
Bacbababaabcbab //let I be iterating variable over this text
Abababca //j be iterating variable over pattern
When i=0 and j=0 there is a mismatch (text[i+j]!=pat[j]), we increment i until there is a match . When i =4 , there is a match(text[i+j]==pat[j]), increment j till we find mismatch (if j= patternlength we found pattern) ,in the given example we find mismatch at j=5 when i=4 , a mismatch happens at idex 4+5=9 in text. The sub string that matched is ababa , **
Why we need to choose longest proper prefix which is also proper suffix :
**
From the above : we see that mismatch happened at 9 where pattern ended with substring ababa .
Now if we want to take advantage over the comparisons we have done so far then we can skip (increment) i more than 1 then the numbers of comparisons will be reduced leading to better time complexity.
Now what advantage we can take on processed comparison data “ababa” .
If we see carefully: the prefix aba of string ababa is compared with text and matched, same is the case with suffix aba. But there is a mismatch ‘b’ with ‘a’
Bacbababaabcbab
||||||
||||| x
|||| ||
ababab
But according to naïve approach, we increment i to 5. But we know by looking at it, we can set i =6 as next occurrence of aba occurs at 6. So instead of comparing with each and every element in text we preprocess the pattern for finding the longest proper prefix which is also proper suffix (which is called border). In the above example for ‘ababa’ ,and length of longest border is 3 (which is aba) . So increment by: length of substring – length of longest border => 5-3 =2.
If our comparison fails at aba , then length of longest border is 1 and j=3 so increment by 2 .
For more on how to preprocess : http://www-igm.univ-mlv.fr/~lecroq/string/node8.html#SECTION0080 http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With