Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the theory behind KMP pattern matching algorithm? [closed]

What is the theoretical basis of KMP pattern-matching algorithm?

I understand the algorithm itself, but, don't understand how did Knuth, Morris and Pratt invent this algorithm.

Was there any mathematical proof?

Can you give a link please?

like image 369
user366312 Avatar asked Dec 10 '11 05:12

user366312


People also ask

What is KMP algorithm for pattern matching?

Knuth Morris Pratt (KMP) is an algorithm, which checks the characters from left to right. When a pattern has a sub-pattern appears more than one in the sub-pattern, it uses that property to improve the time complexity, also for in the worst case. The time complexity of KMP is O(n).

What is the basic principle in KMP algorithm?

The basic idea behind KMP's algorithm is: whenever we detect a mismatch (after some matches), we already know some of the characters in the text of the next window. We take advantage of this information to avoid matching the characters that we know will anyway match.

How does KMP failure function work?

The failure function actually tells us this: if you matched X characters of a string, what is the longest suffix of such string such that it's also a prefix of a search string. You are asking how it's built, the approach is quite straightforward.

Where is KMP algorithm used?

In real world KMP algorithm is used in those applications where pattern matching is done in long strings, whose symbols are taken from an alphabet with little cardinality. A relevant example is the DNA alphabet, which consists on only 4 symbols (A,C,G,T).


2 Answers

The KMP matching algorithm is based on finite automata and works by implicitly building the transition table for an automaton that matches the string. Using a very clever linear-time preprocessing of the string to search for, a matching automaton can be constructed, and the automaton can then be simulated on the string to search in in linear time. The net result is a linear-time algorithm for string matching.

The automaton that's constructed works by having one state for each amount of the string that has been matched so far. By default, the transitions are such that matching the next character advances to the next state in the machine and reading an invalid character transitions back to the beginning. However, certain pieces of the string to search for might share some overlapping structure, so some new transitions are added that take the automaton back to an earlier (but not the first) state when a character is read.

This algorithm is generalized by the Aho-Corasick algorithm, which builds a more complex automaton in order to search for multiple strings at once.

I have an implementation of this algorithm on my personal site that contains an extensive discussion of the actual details of how the algorithm works, including a correctness proof, more formal intuition behind the algorithm, and explanation of how to derive the algorithm from first principles. It took me a while to put together, but I hope that it might be a good resource to learn more about the algorithm.

Hope this helps!

like image 150
templatetypedef Avatar answered Sep 28 '22 15:09

templatetypedef


Morris discovered the algorithm from first principles, but Knuth independently learned about a theorem due to Stephen A. Cook that deterministic two-way pushdown automata could be simulated in linear-time and extracted an early version of "KMP" by specializing the simulation to a string matching automaton. Pratt contributed an efficiency improvement. See Knuth's retelling.

like image 41
Per Avatar answered Sep 28 '22 14:09

Per