I am implementing Rabin-Karp algorithm for fun. I came across this pseudocode:
RABIN -KARP -MATCHER (T, P, d, q)
1 n = T.length
2 m = P.length
3 h = d^(m-1) mod q
4 p=0
5 t= 0
6 for i = 1 to m
/ preprocessing
/
7 p = (dp + P [i]) mod q
8 t = (dt + T [i]) mod q
9 for s = 0 to n-m
/ matching
/
10 if p == t
11 if P [1... m] == T [s + 1...s + m]
12 print “Pattern occurs with shift” s
13 if s < n-m
14 t = (d(t-T[s + 1]h) + T [s + m + 1]) mod q
I implemented in Python 2.7 like so:
def Rabin_Karp_Matcher(text, pattern, d, q):
n = len(text)
m = len(pattern)
h = pow(d,m-1)%q
p = 0
t =0
result = []
for i in range(m): # preprocessing
p = (d*p+ord(pattern[i]))%q
t = (d*t+ord(text[i]))%q
for s in range(n-m):
if p == t: # check character by character
match = True
for i in range(m):
if pattern[i] != text[s+i]:
match = False
break
if match:
result = result + [s]
if s < n-m:
t = (d*(t-ord(text[s+1])*h)+ord(text[s+m]))%q #index out of bounds here
return result
where result is a list containing the index in text of pattern.
My code is failing to find the 26 in 3141592653589793 and I suspect it has something to do with my hashcode defined by line 14 of the pseudocode. Can anyone please help with this.
I passed in the following paramters:
P = "26" T = "3141592653589793" d = 257 q = 11
P and T must be strings/arrays of chars
Rabin-Karp algorithm is an algorithm used for searching/matching patterns in the text using a hash function. Unlike Naive string matching algorithm, it does not travel through every character in the initial phase rather it filters the characters that do not match and then performs the comparison.
Rabin-Karp cannot use just any hash function, it requires a specialised hash function that can be quickly calculated for a position i from the already known value for position (i-1). Yes, @Gigi, I have.
Like the Naive Algorithm, Rabin-Karp algorithm also slides the pattern one by one. But unlike the Naive algorithm, Rabin Karp algorithm matches the hash value of the pattern with the hash value of current substring of text, and if the hash values match then only it starts matching individual characters.
But Rabin-Carp's algorithm implies using of rolling hash function. Rolling hash function, is the function which have special properties: if we already know value of H (c [0..n]), for example, we can compute H (c [1..n+1]) quickly. This is property of rolling hash function, which bernstein hash doesn't have! I think, we should downvote this answer!
Here is a working version of your code:
def Rabin_Karp_Matcher(text, pattern, d, q):
n = len(text)
m = len(pattern)
h = pow(d,m-1)%q
p = 0
t = 0
result = []
for i in range(m): # preprocessing
p = (d*p+ord(pattern[i]))%q
t = (d*t+ord(text[i]))%q
for s in range(n-m+1): # note the +1
if p == t: # check character by character
match = True
for i in range(m):
if pattern[i] != text[s+i]:
match = False
break
if match:
result = result + [s]
if s < n-m:
t = (t-h*ord(text[s]))%q # remove letter s
t = (t*d+ord(text[s+m]))%q # add letter s+m
t = (t+q)%q # make sure that t >= 0
return result
print (Rabin_Karp_Matcher ("3141592653589793", "26", 257, 11))
print (Rabin_Karp_Matcher ("xxxxx", "xx", 40999999, 999999937))
The output is:
[6]
[0, 1, 2, 3]
On the first step, you check whether text[0..m] == pattern
. On the second step, you want to check whether text[1..m+1] == pattern
. Thus you remove text[0]
from the hash (at the moment it is multiplied by your precomputed h
): t = (t-h*ord(text[s]))%q
. And then, add text[m]
to it: t = (t*d+ord(text[s+m]))%q
. On the next step, you remove text[1]
and add text[m+1]
, and so on. The t = (t+q)%q
line is there because a negative number modulo q
yields remainder in the range (-q; 0]
, and we want it to be in the range [0; q)
.
Note that you want to check a total of n-m+1
substrings, not n-m
, hence the correct loop is for s in range(n-m+1)
. Checked by the second example (finding "xx" in "xxxxx").
Also worth noting:
The line h = pow(d,m-1)%q
may be too slow if m
is large. It is better to take the result modulo q
after each of the m-2
multiplications.
This algorithm is still O(nm) in the worst case. With text="a"*100000
and pattern="a"*50000
, it will find 50001 positions where a substring of text matches the pattern, and it will check them all character-by-character. If you expect your code to work fast in such extreme cases, you should skip the character-by-character comparison and find a way to deal with false positives (i.e., hashes are equal but strings are not). Picking a large prime number q
may help reduce the probability of a false positive to an acceptable level.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With