I have been trying to understand KMP algorithm. Still I didn't get the clear understanding of reasoning behind kmp algorithm. Suppose my text is <code>bacbababaabcbab</code> and pattern is <code>abababca</code>. By using the rule of length of longest proper prefix of <code>sub(pattern)</code> that matches the proper suffix of <code>sub(pattern)</code>, I filled my <code>table[]</code>. <blockquote> a b a b a b c a 0 0 1 2 3 4 0 1 </blockquote> Now I started applying KMP algorithm on the text with my pattern and table. After coming to index 4 of above text, we have a match of <code>length(l)=5;</code> by looking at <code>table[l-1]=3;</code> As per KMP algorithm we can skip length up to 2 chars and can continue . <blockquote> bacbababaabcbab ----xx||| abababca </blockquote> Here I am not getting the logic behind shifting. Why should we shift? Can somebody please clarify my confusion?

To understand the logic behind the KMP algorithm , you should first understand , how this KMP algo is better than brute-force algorithm . <pre class="prettyprint"><code>Idea </code></pre> After a shift of the pattern, the naive algorithm has forgotten all information about previously matched symbols. So it is possible that it re-compares a text symbol with different pattern symbols again and again. This leads to its worst case complexity of Θ(nm) (n: length of the text, m: length of the pattern). The algorithm of Knuth, Morris and Pratt [KMP 77] makes use of the information gained by previous symbol comparisons. It never re-compares a text symbol that has matched a pattern symbol. As a result, the complexity of the searching phase of the Knuth-Morris-Pratt algorithm is in O(n). However, a preprocessing of the pattern is necessary in order to analyze its structure. The preprocessing phase has a complexity of O(m). Since m<=n, the overall complexity of the Knuth-Morris-Pratt algorithm is in O(n). text :bacbababaabcbab pattern:abababca In brute-force method , Slide the pattern over text one by one and check for a match. If a match is found, then slides by 1 again to check for subsequent matches . <pre class="prettyprint"><code>void search(char *pat, char *txt) { int M = strlen(pat); int N = strlen(txt); /* A loop to slide pat[] one by one */ for (int i = 0; i <= N - M; i++) { int j; /* For current index i, check for pattern match */ for (j = 0; j < M; j++) { if (txt[i+j] != pat[j]) break; } if (j == M) // if pat[0...M-1] = txt[i, i+1, ...i+M-1] { printf("Pattern found at index %d \n", i); } } } </code></pre> The complexity of above algorithm is O(nm). In the above algorithm we never used comparison data we processed, <pre class="prettyprint"><code>Bacbababaabcbab //let I be iterating variable over this text Abababca //j be iterating variable over pattern </code></pre> When i=0 and j=0 there is a mismatch (text[i+j]!=pat[j]), we increment i until there is a match . When i =4 , there is a match(text[i+j]==pat[j]), increment j till we find mismatch (if j= patternlength we found pattern) ,in the given example we find mismatch at j=5 when i=4 , a mismatch happens at idex 4+5=9 in text. The sub string that matched is ababa , ** <ul> <li><code>Why we need to choose longest proper prefix which is also proper suffix :</code></li> </ul> ** From the above : we see that mismatch happened at 9 where pattern ended with substring ababa . Now if we want to take advantage over the comparisons we have done so far then we can skip (increment) i more than 1 then the numbers of comparisons will be reduced leading to better time complexity. Now what advantage we can take on processed comparison data “ababa” . If we see carefully: the prefix aba of string ababa is compared with text and matched, same is the case with suffix aba. But there is a mismatch ‘b’ with ‘a’ <pre class="prettyprint"><code>Bacbababaabcbab |||||| ||||| x |||| || ababab </code></pre> But according to naïve approach, we increment i to 5. But we know by looking at it, we can set i =6 as next occurrence of aba occurs at 6. So instead of comparing with each and every element in text we preprocess the pattern for finding the longest proper prefix which is also proper suffix (which is called border). In the above example for ‘ababa’ ,and length of longest border is 3 (which is aba) . So increment by: length of substring – length of longest border => 5-3 =2. If our comparison fails at aba , then length of longest border is 1 and j=3 so increment by 2 . For more on how to preprocess : http://www-igm.univ-mlv.fr/~lecroq/string/node8.html#SECTION0080 http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm

Reasoning behind shifting over the text whem mismatch occurs in KMP algorithm?

Tags:

string

algorithm

pattern-matching

I have been trying to understand KMP algorithm. Still I didn't get the clear understanding of reasoning behind kmp algorithm. Suppose my text is bacbababaabcbab and pattern is abababca. By using the rule of length of longest proper prefix of sub(pattern) that matches the proper suffix of sub(pattern), I filled my table[].

a b a b a b c a
0 0 1 2 3 4 0 1

Now I started applying KMP algorithm on the text with my pattern and table.

After coming to index 4 of above text, we have a match of length(l)=5; by looking at table[l-1]=3; As per KMP algorithm we can skip length up to 2 chars and can continue .

bacbababaabcbab
----xx|||
abababca

Here I am not getting the logic behind shifting. Why should we shift? Can somebody please clarify my confusion?

621

asked Sep 14 '13 13:09

Riding Cave

1 Answers

To understand the logic behind the KMP algorithm , you should first understand , how this KMP algo is better than brute-force algorithm .

Idea

After a shift of the pattern, the naive algorithm has forgotten all information about previously matched symbols. So it is possible that it re-compares a text symbol with different pattern symbols again and again. This leads to its worst case complexity of Θ(nm) (n: length of the text, m: length of the pattern).

The algorithm of Knuth, Morris and Pratt [KMP 77] makes use of the information gained by previous symbol comparisons. It never re-compares a text symbol that has matched a pattern symbol. As a result, the complexity of the searching phase of the Knuth-Morris-Pratt algorithm is in O(n).

However, a preprocessing of the pattern is necessary in order to analyze its structure. The preprocessing phase has a complexity of O(m). Since m<=n, the overall complexity of the Knuth-Morris-Pratt algorithm is in O(n).

text :bacbababaabcbab pattern:abababca

In brute-force method , Slide the pattern over text one by one and check for a match. If a match is found, then slides by 1 again to check for subsequent matches .

void search(char *pat, char *txt)
{
    int M = strlen(pat);
    int N = strlen(txt);

    /* A loop to slide pat[] one by one */
    for (int i = 0; i <= N - M; i++)
    {
        int j;

        /* For current index i, check for pattern match */
        for (j = 0; j < M; j++)
        {
            if (txt[i+j] != pat[j])
                break;
        }
        if (j == M)  // if pat[0...M-1] = txt[i, i+1, ...i+M-1]
        {
           printf("Pattern found at index %d \n", i);
        }
    }
}

The complexity of above algorithm is O(nm). In the above algorithm we never used comparison data we processed,

Bacbababaabcbab   //let I be iterating variable over this text

Abababca    //j be iterating variable over pattern

When i=0 and j=0 there is a mismatch (text[i+j]!=pat[j]), we increment i until there is a match . When i =4 , there is a match(text[i+j]==pat[j]), increment j till we find mismatch (if j= patternlength we found pattern) ,in the given example we find mismatch at j=5 when i=4 , a mismatch happens at idex 4+5=9 in text. The sub string that matched is ababa , **

Why we need to choose longest proper prefix which is also proper suffix :

** From the above : we see that mismatch happened at 9 where pattern ended with substring ababa . Now if we want to take advantage over the comparisons we have done so far then we can skip (increment) i more than 1 then the numbers of comparisons will be reduced leading to better time complexity.
Now what advantage we can take on processed comparison data “ababa” . If we see carefully: the prefix aba of string ababa is compared with text and matched, same is the case with suffix aba. But there is a mismatch ‘b’ with ‘a’

Bacbababaabcbab
         ||||||            
         ||||| x
        |||| ||
        ababab

But according to naïve approach, we increment i to 5. But we know by looking at it, we can set i =6 as next occurrence of aba occurs at 6. So instead of comparing with each and every element in text we preprocess the pattern for finding the longest proper prefix which is also proper suffix (which is called border). In the above example for ‘ababa’ ,and length of longest border is 3 (which is aba) . So increment by: length of substring – length of longest border => 5-3 =2.
If our comparison fails at aba , then length of longest border is 1 and j=3 so increment by 2 .

For more on how to preprocess : http://www-igm.univ-mlv.fr/~lecroq/string/node8.html#SECTION0080 http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm

127

answered Nov 15 '22 11:11

Imposter

Related questions
                            
                                How can I detect onclick() or similar for individual characters in a text?
                            
                                Convert strings specified by length (not NUL-terminated) to int/float? [duplicate]
                            
                                How do I add a line break at the mid point of a string split by whitespace
                            
                                Valgrind: "Invalid read" with c_str and strtod
                            
                                java System.out.println() strange behavior long string
                            
                                How does SequenceMatcher.ratio works in difflib
                            
                                How can I feed a ISO4217 Currency Code to a NumberFormat?
                            
                                Linq query a string array in c# if contains either of two values?
                            
                                In C++, is the amortized complexity of std::string::push_back() O(1)?
                            
                                How do I know whether a character to a given language? In Unicode string [duplicate]
                            
                                Get current word on caret position
                            
                                C++ string-like class with implicit conversion
                            
                                Returning wrong MD5 hash in C
                            
                                split complex string
                            
                                String Creation and char array Memory Allocation
                            
                                Why is split(' ') trying to be (too) smart?
                            
                                libcurl get JSON string
                            
                                Creating string with escape java
                            
                                Best way to compare two large sets of strings in Python
                            
                                constructing an identifier string for each row in data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With