I've seen this Rabin Karp string matching algorithm in the forums on the website and I'm interested in trying to implement it but I was wondering If anyone could tell me why the variables ulong Q and ulong D are 100007 and 256 respectively :S? What significance do these values carry with them?
static void Main(string[] args)
{
string A = "String that contains a pattern.";
string B = "pattern";
ulong siga = 0;
ulong sigb = 0;
ulong Q = 100007;
ulong D = 256;
for (int i = 0; i < B.Length; i++)
{
siga = (siga * D + (ulong)A[i]) % Q;
sigb = (sigb * D + (ulong)B[i]) % Q;
}
if (siga == sigb)
{
Console.WriteLine(string.Format(">>{0}<<{1}", A.Substring(0, B.Length), A.Substring(B.Length)));
return;
}
ulong pow = 1;
for (int k = 1; k <= B.Length - 1; k++)
pow = (pow * D) % Q;
for (int j = 1; j <= A.Length - B.Length; j++)
{
siga = (siga + Q - pow * (ulong)A[j - 1] % Q) % Q;
siga = (siga * D + (ulong)A[j + B.Length - 1]) % Q;
if (siga == sigb)
{
if (A.Substring(j, B.Length) == B)
{
Console.WriteLine(string.Format("{0}>>{1}<<{2}", A.Substring(0, j),
A.Substring(j, B.Length),
A.Substring(j + B.Length)));
return;
}
}
}
Console.WriteLine("Not copied!");
}
Karp and Michael O. Rabin, the Rabin-Karp algorithm or Karp-Rabin algorithm is a string-searching algorithm that utilises hashing to find matches between a given search pattern and a text. A naive string-searching algorithm would compare the given pattern against all positions in the text.
KMP guarantees 100% reliability. You cannot guarantee 100% with Rabin Karp because of a chance of collision during hash table lookup. But with good hash generation algorithms that do exist today, it is possible that Rabin Karp can yield very close to 100% reliability in finding a match.
About the magic numbers Paul's answer is pretty clear.
As far as the code is concerned, Rabin Karp's principal idea is to perform an hash comparison between a sliding portion of the string and the pattern.
The hash cannot be computed each time on the whole substrings, otherwise the computation complexity would be quadratic O(n^2)
instead of linear O(n)
.
Therefore, a rolling hash function is applied, such as at each iteration only one character is needed to update the hash value of the substring.
So, let's comment your code:
for (int i = 0; i < B.Length; i++)
{
siga = (siga * D + (ulong)A[i]) % Q;
sigb = (sigb * D + (ulong)B[i]) % Q;
}
if (siga == sigb)
{
Console.WriteLine(string.Format(">>{0}<<{1}", A.Substring(0, B.Length), A.Substring(B.Length)));
return;
}
^
This piece computes the hash of pattern B
(sigb
), and the hashcode of the initial substring of A
of the same length of B
.
Actually it's not completely correct because hash can collide¹ and so, it is necessary to modify the if statement : if (siga == sigb && A.Substring(0, B.Length) == B)
.
ulong pow = 1;
for (int k = 1; k <= B.Length - 1; k++)
pow = (pow * D) % Q;
^
Here's computed pow
that is necessary to perform the rolling hash.
for (int j = 1; j <= A.Length - B.Length; j++)
{
siga = (siga + Q - pow * (ulong)A[j - 1] % Q) % Q;
siga = (siga * D + (ulong)A[j + B.Length - 1]) % Q;
if (siga == sigb)
{
if (A.Substring(j, B.Length) == B)
{
Console.WriteLine(string.Format("{0}>>{1}<<{2}", A.Substring(0, j),
A.Substring(j, B.Length),
A.Substring(j + B.Length)));
return;
}
}
}
^
Finally, the remaining string (i.e. from the second character to end), is scanned updating the hash value of the A substring and compared with the hash of B (computed at the beginning).
If the two hashes are equal, the substring and the pattern are compared¹ and if they're actually equal a message is returned.
¹ Hash values can collide; hence, if two strings have different hash values they're definitely different, but if the two hashes are equal they can be equal or not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With