I am interested in implementing the Rabin-Karp algorithm to search for sub strings as stated on wiki: http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm. Not for homework, but for self-interest. I have implemented the Rabin-Karp algorithm (shown below) and it works. However, the performance is really, really bad!!! I understand that my hash function is basic. However, it seems that a simple call to strstr() will always outperform my function rabin_karp(). I can understand why - the hash function is doing more work than a simple char-by-char compare each loop. What am I missing here? Should the Rabin-Karp algorithm be faster than a call to strstr()? When is the Rabin-Karp algorithm best used? Hence my self-interest. Have I even implemented the algorithm right?
size_t hash(char* str, size_t i)
{
size_t h = 0;
size_t magic_exp = 1;
// if (str != NULL)
{
while (i-- != 0)
{
magic_exp *= 101;
h += magic_exp + *str;
++str;
}
}
return h;
}
char* rabin_karp(char* s, char* find)
{
char* p = NULL;
if (s != NULL && find != NULL)
{
size_t n = strlen(s);
size_t m = strlen(find);
if (n > m)
{
size_t hfind = hash(find, m);
char* end = s + (n - m + 1);
for (char* i = s; i < end; ++i)
{
size_t hs = hash(i, m);
if (hs == hfind)
{
if (strncmp(i, find, m) == 0)
{
p = i;
break;
}
}
}
}
}
return p;
}
You haven't implemented the hash correctly. The key to Rabin-Karp is to incrementally update the hash as the potential match moves along the string to be searched. As you've determined, if you recalculate the entire hash for each potential match position, things will be really slow.
For every case except for the first comparison, your hash function should take an existing hash, one new character, and one old character, and return an updated hash.
Rabin-Karp is a rolling hash algorithm - the idea is to be able to move the substring one position to either direction(left or right) and be able to recompute the hash with constant number of operations. As you have implemented it the search has complexity O(N * L) where N is the length of the big string and L is the length of the string you are searching for. This is the complexity of the most naive approach and is in fact a little pesimization to it in my opinion.
To improve your algorithm precompute the exponents of magic_exp and use them to 'roll' your hash - basically just as with polynoms you need to subtract the highest degree multiply by magic_exp and add the hash of the symbol to the right(for moving the hash to the right).
Hope this helps.
strstr
is using the KMP algorithm which is also linear in nature. This means that the complexity of the two algorithms is approximately the same. From then on the constant is the important factor. Especially in the case where you have bad hash functions with a lot of collisions the KMP will be a lot faster.
EDIT: One more thing. It is very important for the Rabin Karp algorithm to have all the hash codes of the prefixes precalculated. Now you are not implementing proper Rabin Karp, because the calls to your function will be linear, not constant in complexity. (Which by the way means that wikipedia is not very good source to learn Rabin Karp from).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With