I'm learning about string searching algorithms and understand how they work but haven't found a good enough answer about in which cases Rabin-Karp algorithm would be more effective than KMP or Boyer-Moore. I see that it is easier to implement and doesn't need the same overhead but beyond that, I have no clue.
So, when is Rabin-Karp better to use than the others?
Also Rabin Karp is easier to implement than KMP it works based on a rolling hash whereas KMP works based on a failure function. How do I get kruskal algorithm?
The comparison algorithm is Knuth Morris Pratt (KMP) and Boyer Moore (BM) algorithm. Based on previous research, the KMP algorithm has a better performance compared to other string matching algorithms. However, other studies have concluded that the BM algorithm has better performance.
Advantage over Naive String Matching AlgorithmThis technique results in only one comparison per text sub-sequence and brute force is only required when the hash values match.
The Rabin–Karp algorithm is inferior for single pattern searching to Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms because of its slow worst case behavior. However, it is a useful algorithm for multiple pattern search.
The Rabin-Karp algorithm is better when searching for a large text that is finding multiple pattern matches, like detecting plagiarism.
And Boyer-Moore is better when the pattern is relatively large with a moderately sized alphabet and with a large vocabulary. And it does not work well with binary strings or very short patterns.
Meanwhile, KMP is good for searching inside a smaller alphabet, like in bioinformatics or searching in binary strings. And it does not run fast if the alphabet increases.
There are a couple of properties that each of these algorithms have that might make them desirable or undesirable in different circumstances. Here's a quick rundown:
One major advantage of Rabin-Karp is that it uses O(1) auxiliary storage space, which is great if the pattern string you're looking for is very large. For example, if you're looking for all occurrences of a string of length 107 in a longer string of length 109, not having to allocate a table of 107 machine words for a failure function or shift table is a major win. Both Boyer-Moore and KMP use Ω(n) memory on a pattern string of length n, so Rabin-Karp would be a clear win here.
Rabin-Karp suffers from two potential worst cases. First, if the particular prime numbers used by Rabin-Karp are known to a malicious adversary, that adversary could potentially craft an input that causes the rolling hash to match the hash of a pattern string at each point in time, causing the algorithm's performance to degrade to Ω((m - n + 1)n) on a string of length m and pattern of length n. If you're taking untrusted strings as input, this could potentially be an issue. Neither Boyer-Moore nor KMP have these weaknesses.
Similarly, Rabin-Karp is slow in the case where you want to find all matches of a pattern string in the case where that pattern appears a large number of times. For example, if you're looking for a string of 105 copies of the letter a
in text string consisting of 109copies of the letter a
with Rabin-Karp, then there will be lots of spots where the pattern string appears, and each will require a linear scan. This can also lead to a runtime of Ω((m + n - 1)n).
Many Boyer-Moore implementations suffer from this second rule, but will not have bad runtimes in the first case. And KMP has no pathological worst-cases like these.
One advantage of the Boyer-Moore algorithm is that it doesn't necessarily have to scan all the characters of the input string. Specifically, the Bad Character Rule can be used to skip over huge regions of the input string in the event of a mismatch. More specifically, the best-case runtime for Boyer-Moore is O(m / n), which is much faster than what Rabin-Karp or KMP can provide.
Suppose you have a fixed set of multiple text strings that you want to search for, rather than just one. You could, if you wanted to, run multiple passes of Rabin-Karp, KMP, or Boyer-Moore across the strings to find all the matches. However, the runtime of this approach isn't great, as it scales linearly with the number of strings to search for. On the other hand, KMP generalizes nicely to the Aho-Corasick string-matching algorithm, which runs in time O(m + n + z), where z is the number of matches found and n is the combined length of the pattern strings. Notice that there's no dependence here on the number of different pattern strings being searched for!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With