Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's wrong with my implementation of the KMP algorithm?

static void Main(string[] args)
{
    string str = "ABC ABCDAB ABCDABCDABDE";//We should add some text here for 
                                           //the performance tests.

    string pattern = "ABCDABD";


    List<int> shifts = new List<int>();

    Stopwatch stopWatch = new Stopwatch();

    stopWatch.Start();
    NaiveStringMatcher(shifts, str, pattern);
    stopWatch.Stop();
    Trace.WriteLine(String.Format("Naive string matcher {0}", stopWatch.Elapsed));

    foreach (int s in shifts)
    {
        Trace.WriteLine(s);
    }

    shifts.Clear();
    stopWatch.Restart();

    int[] pi = new int[pattern.Length];
    Knuth_Morris_Pratt(shifts, str, pattern, pi);
    stopWatch.Stop();
    Trace.WriteLine(String.Format("Knuth_Morris_Pratt {0}", stopWatch.Elapsed));

    foreach (int s in shifts)
    {
        Trace.WriteLine(s);
    }

    Console.ReadKey();
}

static IList<int> NaiveStringMatcher(List<int> shifts, string text, string pattern)
{
    int lengthText = text.Length;
    int lengthPattern = pattern.Length;

    for (int s = 0; s < lengthText - lengthPattern + 1; s++ )
    {
        if (text[s] == pattern[0])
        {
            int i = 0;
            while (i < lengthPattern)
            {
                if (text[s + i] == pattern[i])
                    i++;
                else break;
            }
            if (i == lengthPattern)
            {
                shifts.Add(s);                        
            }
        }
    }

    return shifts;
}

static IList<int> Knuth_Morris_Pratt(List<int> shifts, string text, string pattern, int[] pi)
{

    int patternLength = pattern.Length;
    int textLength = text.Length;            
    //ComputePrefixFunction(pattern, pi);

    int j;

    for (int i = 1; i < pi.Length; i++)
    {
        j = 0;
        while ((i < pi.Length) && (pattern[i] == pattern[j]))
        {
            j++;
            pi[i++] = j;
        }
    }

    int matchedSymNum = 0;

    for (int i = 0; i < textLength; i++)
    {
        while (matchedSymNum > 0 && pattern[matchedSymNum] != text[i])
            matchedSymNum = pi[matchedSymNum - 1];

        if (pattern[matchedSymNum] == text[i])
            matchedSymNum++;

        if (matchedSymNum == patternLength)
        {
            shifts.Add(i - patternLength + 1);
            matchedSymNum = pi[matchedSymNum - 1];
        }

    }

    return shifts;
}

Why does my implemention of the KMP algorithm work slower than the Naive String Matching algorithm?

like image 562
Grigory Bushuev Avatar asked May 10 '11 19:05

Grigory Bushuev


2 Answers

The KMP algorithm has two phases: first it builds a table, and then it does a search, directed by the contents of the table.

The naive algorithm has one phase: it does a search. It does that search much less efficiently in the worst case than the KMP search phase.

If the KMP is slower than the naive algorithm then that is probably because building the table is taking you longer than it takes to simply search the string naively in the first place. Naive string matching is usually very fast on short strings. There is a reason why we don't use fancy-pants algorithms like KMP inside the BCL implementations of string searching. By the time you set up the table, you could have done half a dozen searches of short strings with the naive algorithm.

KMP is only a win if you have enormous strings and you are doing lots of searches that allow you to re-use an already-built table. You need to amortize away the huge cost of building the table by doing lots of searches using that table.

And also, the naive algorithm only has bad performance in bizarre and unlikely scenarios. Most people are searching for words like "London" in strings like "Buckingham Palace, London, England", and not searching for strings like "BANANANANANANA" in strings like "BANAN BANBAN BANBANANA BANAN BANAN BANANAN BANANANANANANANANAN...". The naive search algorithm is optimal for the first problem and highly sub-optimal for the latter problem; but it makes sense to optimize for the former, not the latter.

Another way to put it: if the searched-for string is of length w and the searched-in string is of length n, then KMP is O(n) + O(w). The Naive algorithm is worst case O(nw), best case O(n + w). But that says nothing about the "constant factor"! The constant factor of the KMP algorithm is much larger than the constant factor of the naive algorithm. The value of n has to be awfully big, and the number of sub-optimal partial matches has to be awfully large, for the KMP algorithm to win over the blazingly fast naive algorithm.

That deals with the algorithmic complexity issues. Your methodology is also not very good, and that might explain your results. Remember, the first time you run code, the jitter has to jit the IL into assembly code. That can take longer than running the method in some cases. You really should be running the code a few hundred thousand times in a loop, discarding the first result, and taking an average of the timings of the rest.

If you really want to know what is going on you should be using a profiler to determine what the hot spot is. Again, make sure you are measuring the post-jit run, not the run where the code is jitted, if you want to have results that are not skewed by the jit time.

like image 171
Eric Lippert Avatar answered Oct 23 '22 13:10

Eric Lippert


Your example is too small and it does not have enough repetitions of the pattern where KMP avoids backtracking.

KMP can be slower than the normal search in some cases.

like image 37
Javi Avatar answered Oct 23 '22 14:10

Javi