Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String Matching: Computing the longest prefix suffix array in kmp algorithm

KMP algorithm for string matching. Following is the code I found online for computing the longest prefix-suffix array:
Defination:

lps[i] = the longest proper prefix of pat[0..i] 
              which is also a suffix of pat[0..i]. 

Code:

void computeLPSArray(char *pat, int M, int *lps)
{
    int len = 0;  // length of the previous longest prefix suffix
    int i;

    lps[0] = 0; // lps[0] is always 0
    i = 1;

    // the loop calculates lps[i] for i = 1 to M-1
    while(i < M)
    {
       if(pat[i] == pat[len])
       {
         len++;
         lps[i] = len;
         i++;
       }
       else // (pat[i] != pat[len])
       {
         if( len != 0 )
         {
           // This is tricky. Consider the example AAACAAAA and i = 7.
           len = lps[len-1]; //*****************

           // Also, note that we do not increment i here
         }
         else // if (len == 0)
         {
           lps[i] = 0;
           i++;
         }
       }
    }
}

Can I use len = len-1 instead of len = lps[len-1]?
because len always counts the prefix length like from [0 .. someIndex]. Then why use lps for assignment here? Following are the cases for which I tested which work fine(first line is the pattern and subsequent two lines are the result for original and modified assignment to len ) :

a  a  a  b  a  b  c  
0  1  2  0  1  0  0  
0  1  2  0  1  0  0 

a  b  c  b  a  b  c  
0  0  0  0  1  2  3  
0  0  0  0  1  2  3  

a  a  b  c  b  a  b  
0  1  0  0  0  1  0  
0  1  0  0  0  1  0  

Code here with both variations written : http://ideone.com/qiSrUo

like image 856
Sahil Sareen Avatar asked Mar 24 '14 17:03

Sahil Sareen


People also ask

What is prefix and suffix in KMP algorithm?

The main component of the KMP algorithm is the prefix function. It is defined as for a string s of length n, π[i] is the length of the longest proper prefix of the substring s[0…i], which is also a suffix of this substring where the proper prefix of a string is a prefix that is not equal to the string itself.

What is KMP string matching algorithm?

Knuth Morris Pratt (KMP) is an algorithm, which checks the characters from left to right. When a pattern has a sub-pattern appears more than one in the sub-pattern, it uses that property to improve the time complexity, also for in the worst case. The time complexity of KMP is O(n).

What is the computing time of KMP algorithm?

The time complexity of KMP algorithm is O(n) in the worst case. KMP (Knuth Morris Pratt) Pattern Searching The Naive pattern searching algorithm doesn't work well in cases where we see many matching characters followed by a mismatching character. Following are some examples.

Which of the following algorithms compares a pattern with itself to find the longest prefix that is also the suffix of the string *?

In the KMP string matching algorithm, we calculate the LPS array such that LPS[i] denotes the longest proper prefix that is also a suffix of the substring ending at index 'i'. We can use the same approach to calculate the LPS array. Our final answer will be LPS[n-1].


1 Answers

Following a case for which it does not work:

i     0  1  2  3  4  5
p     A  B  A  B  B  A 
c1    0  0  1  2  0  1
c2    0  0  1  2  2  3

The reason being:

At i=4, len=2 
p[i]='B' and p[len]='A' //Mismatch!
lps string upto i=3: AB(0-1 prefix), (2-3 suffix)
-------------------------------
i=4
Next charecter: B
len=2 // longest prefix suffix length 
Charecter looking for : A (=p[len])

So upto i=3 we had AB(0-1) as the prefix that matched with suffix AB(2-3), but now at i=4 there is a mismatch so we see that we can't extend the original prefix(0-1) so the position to be checked is the prefix found prior to "AB" which is done by lps[len-1] < -1 as the array starts from 0 > and this is not necessarily len-1 as we may need to step back further than that to get the new longest prefix suffix.

like image 67
Sahil Sareen Avatar answered Oct 13 '22 00:10

Sahil Sareen