Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rabin Karp string matching algorithm

I've seen this Rabin Karp string matching algorithm in the forums on the website and I'm interested in trying to implement it but I was wondering If anyone could tell me why the variables ulong Q and ulong D are 100007 and 256 respectively :S? What significance do these values carry with them?

static void Main(string[] args)
{
    string A = "String that contains a pattern.";
    string B = "pattern";
    ulong siga = 0;
    ulong sigb = 0;
    ulong Q = 100007;
    ulong D = 256;
    for (int i = 0; i < B.Length; i++)
    {
        siga = (siga * D + (ulong)A[i]) % Q;
        sigb = (sigb * D + (ulong)B[i]) % Q;
    }
    if (siga == sigb)
    {
        Console.WriteLine(string.Format(">>{0}<<{1}", A.Substring(0, B.Length), A.Substring(B.Length)));
        return;
    }
    ulong pow = 1;
    for (int k = 1; k <= B.Length - 1; k++)
        pow = (pow * D) % Q;

    for (int j = 1; j <= A.Length - B.Length; j++)
    {
        siga = (siga + Q - pow * (ulong)A[j - 1] % Q) % Q;
        siga = (siga * D + (ulong)A[j + B.Length - 1]) % Q;
        if (siga == sigb)
        {
            if (A.Substring(j, B.Length) == B)
            {
                Console.WriteLine(string.Format("{0}>>{1}<<{2}", A.Substring(0, j),
                                                                    A.Substring(j, B.Length),
                                                                    A.Substring(j + B.Length)));
                return;
            }
        }
    }
    Console.WriteLine("Not copied!");
}
like image 750
c grum Avatar asked Apr 26 '12 17:04

c grum


People also ask

What is the significance of hashing in Rabin Karp pattern matching algorithm?

Karp and Michael O. Rabin, the Rabin-Karp algorithm or Karp-Rabin algorithm is a string-searching algorithm that utilises hashing to find matches between a given search pattern and a text. A naive string-searching algorithm would compare the given pattern against all positions in the text.

Which is better Rabin Karp or KMP algorithm?

KMP guarantees 100% reliability. You cannot guarantee 100% with Rabin Karp because of a chance of collision during hash table lookup. But with good hash generation algorithms that do exist today, it is possible that Rabin Karp can yield very close to 100% reliability in finding a match.


1 Answers

About the magic numbers Paul's answer is pretty clear.

As far as the code is concerned, Rabin Karp's principal idea is to perform an hash comparison between a sliding portion of the string and the pattern.

The hash cannot be computed each time on the whole substrings, otherwise the computation complexity would be quadratic O(n^2) instead of linear O(n).

Therefore, a rolling hash function is applied, such as at each iteration only one character is needed to update the hash value of the substring.

So, let's comment your code:

for (int i = 0; i < B.Length; i++)
{
    siga = (siga * D + (ulong)A[i]) % Q;
    sigb = (sigb * D + (ulong)B[i]) % Q;
}
if (siga == sigb)
{
    Console.WriteLine(string.Format(">>{0}<<{1}", A.Substring(0, B.Length), A.Substring(B.Length)));
    return;
}

^ This piece computes the hash of pattern B (sigb), and the hashcode of the initial substring of A of the same length of B. Actually it's not completely correct because hash can collide¹ and so, it is necessary to modify the if statement : if (siga == sigb && A.Substring(0, B.Length) == B).

ulong pow = 1;
for (int k = 1; k <= B.Length - 1; k++)
    pow = (pow * D) % Q;

^ Here's computed pow that is necessary to perform the rolling hash.

for (int j = 1; j <= A.Length - B.Length; j++)
{
    siga = (siga + Q - pow * (ulong)A[j - 1] % Q) % Q;
    siga = (siga * D + (ulong)A[j + B.Length - 1]) % Q;
    if (siga == sigb)
    {
        if (A.Substring(j, B.Length) == B)
        {
            Console.WriteLine(string.Format("{0}>>{1}<<{2}", A.Substring(0, j),
                                                                A.Substring(j, B.Length),
                                                                A.Substring(j + B.Length)));
            return;
        }
    }
}

^ Finally, the remaining string (i.e. from the second character to end), is scanned updating the hash value of the A substring and compared with the hash of B (computed at the beginning).

If the two hashes are equal, the substring and the pattern are compared¹ and if they're actually equal a message is returned.


¹ Hash values can collide; hence, if two strings have different hash values they're definitely different, but if the two hashes are equal they can be equal or not.

like image 181
digEmAll Avatar answered Sep 24 '22 04:09

digEmAll