Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need help in understanding Rolling Hash computation in constant time for Rabin-Karp Implementation

I've been trying to implement Rabin-Karp algorithm in Java. I have hard time computing the rolling hash value in constant time. I've found one implementation at http://algs4.cs.princeton.edu/53substring/RabinKarp.java.html. Still I could not get how these two lines work.

txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
txtHash = (txtHash*R + txt.charAt(i)) % Q;  

I looked at couple of articles on modular arithmetic but no article could able to penetrate my thick skull. Please give some pointers to understand this.

like image 771
FourOfAKind Avatar asked May 24 '11 11:05

FourOfAKind


People also ask

Why does using a rolling hash make the Rabin-Karp algorithm more efficient?

Rabin-Karp improves upon this concept by utilising the fact that comparing the hashes of two strings can be done in linear time and is far more efficient than comparing the individual characters of those strings to find a match. Thus, providing a best case runtime complexity of O(n+m).

How is hash value calculated in Rabin-Karp algorithm?

Initially calculate the hash value of the pattern. Start iterating from the starting of the string: Calculate the hash value of the current substring having length m. If the hash value of the current substring and the pattern are same check if the substring is same as the pattern.

How do you calculate rolling hash?

Calculation of the hash of a string+ s [ n − 1 ] ⋅ p n − 1 mod m = ∑ i = 0 n − 1 s [ i ] ⋅ p i mod m , where and are some chosen, positive numbers. It is called a polynomial rolling hash function. It is reasonable to make a prime number roughly equal to the number of characters in the input alphabet.

Which string matching algorithm uses the rolling hash function?

It is possible with the help of rolling hash. Rabin Karp algorithm is one of the optimized algorithms of the naive algorithm that performs searching by rolling over the string and search the pattern.


2 Answers

First you need to understand how the hash is computed.

Lets take a simple case of base 10 strings. How would you guarantee that the hash code of a string is unique? Base 10 is what we use to represent numbers, and we don't have collisions!!

"523" = 5*10^2 + 2*10^1 + 3*10^0 = 523

using the above hash function you are guaranteed to get distinct hashes for every string.

Given the hash of "523", if you want to calculate the hash of "238", i.e. by jutting out the leftmost digit 5 and bringing in a new digit 8 from the right, you would have to do the following:

1) remove the effect of the 5 from the hash: hash = hash - 5*10^2 (523-500 = 23)

2) adjust the hash of the remaining chars by shifting by 1, hash = hash * 10

3) add the hash of the new character: hash = hash + 8 (230 + 8 = 238, which as we expected is the base 10 hash of "238")

Now let's extend this to all ascii characters. This takes us to the base 256 world. Therefore the hash of the same string "523" now is

= 5*256^2 + 2*256^1 + 3*256^0 = 327680 + 512 + 3 = 328195.

You can imagine as the string length increases you will will exceed the range of integer/long in most programming languages relatively quickly.

How can we solve this? The way this is routinely solved is by working with modulus a large prime number. The drawback of this method is that we will now get false positives as well, which is a small price to pay if it takes the runtime of your algorithm from quadratic to linear!

The complicated equation you quoted is nothing but the steps 1-3 above done with modulus math. The two modulus properties used above are ->

a) (a*b) % p = ((a % p) * (b % p)) % p

b) a % p = (a + p) % p

Lets go back to steps 1-3 mentioned above ->

1) (expanded using property a) hash = hash - ((5 % p)*(10^2 %p) %p)

vs. what you quoted

txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;

Here are is how the two are related!

  • RM = 10^3 % p
  • txt.charAt(i-M) % Q = 5 % p
  • The additional + Q you see is just to ensure that the hash is not negative. See property b above.

2 & 3) hash = hash*10 + 8, vs txtHash = (txtHash*R + txt.charAt(i)) % Q; Is the same but with taking mod of the final hash result!

Looking at properties a & b more closely, should help you figure it out!

like image 98
Aneesh Avatar answered Nov 15 '22 17:11

Aneesh


This is the "rolling" aspect of the hash. It's eliminating the contribution of the oldest character (txt.charAt(i-M)), and incorporating the contribution of the newest character(txt.charAt(i)).

The hash function is defined as:

            M-1
hash[i] = ( SUM { input[i-j] * R^j } ) % Q
            j=0

(where I'm using ^ to denote "to the power of".)

But this can be written as an efficient recursive implementation as:

hash[i] = (txtHash*R - input[i-M]*(R^M) + input[i]) % Q

Your reference code is doing this, but it's using various techniques to ensure that the result is always computed correctly (and efficiently).

So, for instance, the + Q in the first expression has no mathematical effect, but it ensures that the result of the sum is always positive (if it goes negative, % Q doesn't have the desired effect). It's also breaking the calculation into stages, presumably to prevent numerical overflow.

like image 26
Oliver Charlesworth Avatar answered Nov 15 '22 19:11

Oliver Charlesworth