What is the theoretical basis of KMP pattern-matching algorithm?
I understand the algorithm itself, but, don't understand how did Knuth, Morris and Pratt invent this algorithm.
Was there any mathematical proof?
Can you give a link please?
Knuth Morris Pratt (KMP) is an algorithm, which checks the characters from left to right. When a pattern has a sub-pattern appears more than one in the sub-pattern, it uses that property to improve the time complexity, also for in the worst case. The time complexity of KMP is O(n).
The basic idea behind KMP's algorithm is: whenever we detect a mismatch (after some matches), we already know some of the characters in the text of the next window. We take advantage of this information to avoid matching the characters that we know will anyway match.
The failure function actually tells us this: if you matched X characters of a string, what is the longest suffix of such string such that it's also a prefix of a search string. You are asking how it's built, the approach is quite straightforward.
In real world KMP algorithm is used in those applications where pattern matching is done in long strings, whose symbols are taken from an alphabet with little cardinality. A relevant example is the DNA alphabet, which consists on only 4 symbols (A,C,G,T).
The KMP matching algorithm is based on finite automata and works by implicitly building the transition table for an automaton that matches the string. Using a very clever linear-time preprocessing of the string to search for, a matching automaton can be constructed, and the automaton can then be simulated on the string to search in in linear time. The net result is a linear-time algorithm for string matching.
The automaton that's constructed works by having one state for each amount of the string that has been matched so far. By default, the transitions are such that matching the next character advances to the next state in the machine and reading an invalid character transitions back to the beginning. However, certain pieces of the string to search for might share some overlapping structure, so some new transitions are added that take the automaton back to an earlier (but not the first) state when a character is read.
This algorithm is generalized by the Aho-Corasick algorithm, which builds a more complex automaton in order to search for multiple strings at once.
I have an implementation of this algorithm on my personal site that contains an extensive discussion of the actual details of how the algorithm works, including a correctness proof, more formal intuition behind the algorithm, and explanation of how to derive the algorithm from first principles. It took me a while to put together, but I hope that it might be a good resource to learn more about the algorithm.
Hope this helps!
Morris discovered the algorithm from first principles, but Knuth independently learned about a theorem due to Stephen A. Cook that deterministic two-way pushdown automata could be simulated in linear-time and extracted an early version of "KMP" by specializing the simulation to a string matching automaton. Pratt contributed an efficiency improvement. See Knuth's retelling.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With