Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Rabin-Karp algorithm hashing

I am implementing Rabin-Karp algorithm for fun. I came across this pseudocode:

    RABIN -KARP -MATCHER (T, P, d, q)
    1 n = T.length
    2 m = P.length
    3 h = d^(m-1) mod q
    4 p=0
    5 t= 0
    6 for i = 1 to m
    / preprocessing
    /
    7 p = (dp + P [i]) mod q
    8 t = (dt + T [i]) mod q
    9 for s = 0 to n-m
    / matching
    /
    10     if p == t
    11         if P [1... m] == T [s + 1...s + m]
    12             print “Pattern occurs with shift” s
    13     if s < n-m
    14         t  = (d(t-T[s + 1]h) + T [s + m + 1]) mod q

I implemented in Python 2.7 like so:

def Rabin_Karp_Matcher(text, pattern, d, q):
    n = len(text)
    m = len(pattern)
    h = pow(d,m-1)%q
    p = 0
    t =0
    result = []
    for i in range(m): # preprocessing
        p = (d*p+ord(pattern[i]))%q
        t = (d*t+ord(text[i]))%q
    for s in range(n-m):
        if p == t: # check character by character
            match = True
            for i in range(m):
                if pattern[i] != text[s+i]:
                    match = False
                    break
            if match:
                result = result + [s]
        if s < n-m:
                t = (d*(t-ord(text[s+1])*h)+ord(text[s+m]))%q #index out of bounds here
    return result

where result is a list containing the index in text of pattern.

My code is failing to find the 26 in 3141592653589793 and I suspect it has something to do with my hashcode defined by line 14 of the pseudocode. Can anyone please help with this.

I passed in the following paramters:

P = "26" T = "3141592653589793" d = 257 q = 11

P and T must be strings/arrays of chars

like image 334
SeekingAlpha Avatar asked Mar 06 '14 06:03

SeekingAlpha


People also ask

What is the use of Rabin Karp algorithm?

Rabin-Karp algorithm is an algorithm used for searching/matching patterns in the text using a hash function. Unlike Naive string matching algorithm, it does not travel through every character in the initial phase rather it filters the characters that do not match and then performs the comparison.

Can I use any hash function in Rabin-Karp?

Rabin-Karp cannot use just any hash function, it requires a specialised hash function that can be quickly calculated for a position i from the already known value for position (i-1). Yes, @Gigi, I have.

What is the difference between naive and Rabin-Karp algorithm?

Like the Naive Algorithm, Rabin-Karp algorithm also slides the pattern one by one. But unlike the Naive algorithm, Rabin Karp algorithm matches the hash value of the pattern with the hash value of current substring of text, and if the hash values match then only it starts matching individual characters.

What is the difference between Rabin-carp's algorithm and Bernstein's algorithm?

But Rabin-Carp's algorithm implies using of rolling hash function. Rolling hash function, is the function which have special properties: if we already know value of H (c [0..n]), for example, we can compute H (c [1..n+1]) quickly. This is property of rolling hash function, which bernstein hash doesn't have! I think, we should downvote this answer!


Video Answer


1 Answers

Here is a working version of your code:

def Rabin_Karp_Matcher(text, pattern, d, q):
    n = len(text)
    m = len(pattern)
    h = pow(d,m-1)%q
    p = 0
    t = 0
    result = []
    for i in range(m): # preprocessing
        p = (d*p+ord(pattern[i]))%q
        t = (d*t+ord(text[i]))%q
    for s in range(n-m+1): # note the +1
        if p == t: # check character by character
            match = True
            for i in range(m):
                if pattern[i] != text[s+i]:
                    match = False
                    break
            if match:
                result = result + [s]
        if s < n-m:
            t = (t-h*ord(text[s]))%q # remove letter s
            t = (t*d+ord(text[s+m]))%q # add letter s+m
            t = (t+q)%q # make sure that t >= 0
    return result
print (Rabin_Karp_Matcher ("3141592653589793", "26", 257, 11))
print (Rabin_Karp_Matcher ("xxxxx", "xx", 40999999, 999999937))

The output is:

[6]
[0, 1, 2, 3]

On the first step, you check whether text[0..m] == pattern. On the second step, you want to check whether text[1..m+1] == pattern. Thus you remove text[0] from the hash (at the moment it is multiplied by your precomputed h): t = (t-h*ord(text[s]))%q. And then, add text[m] to it: t = (t*d+ord(text[s+m]))%q. On the next step, you remove text[1] and add text[m+1], and so on. The t = (t+q)%q line is there because a negative number modulo q yields remainder in the range (-q; 0], and we want it to be in the range [0; q).

Note that you want to check a total of n-m+1 substrings, not n-m, hence the correct loop is for s in range(n-m+1). Checked by the second example (finding "xx" in "xxxxx").

Also worth noting:

  1. The line h = pow(d,m-1)%q may be too slow if m is large. It is better to take the result modulo q after each of the m-2 multiplications.

  2. This algorithm is still O(nm) in the worst case. With text="a"*100000 and pattern="a"*50000, it will find 50001 positions where a substring of text matches the pattern, and it will check them all character-by-character. If you expect your code to work fast in such extreme cases, you should skip the character-by-character comparison and find a way to deal with false positives (i.e., hashes are equal but strings are not). Picking a large prime number q may help reduce the probability of a false positive to an acceptable level.

like image 181
Gassa Avatar answered Sep 18 '22 10:09

Gassa