Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rabin–Karp algorithm for plagiarism implementation by using rolling hash

i am using Rabin–Karp algorithm to check plagiarism for any two source code files so firstly i simply implement its algorithm in c # here its code but its average and best case running time is O(n+m) in space O(p), but its worst-case time is O(nm).

 public void plagiarism(string [] file1, string [] file2)
    {
        int percent = 0;

        for (int i = 0; i <(file1.Length - file2.Length +1); i++)
        {

            for (int j = 0; j < file1.Length; j++)
            {
                if (file1[i + j - 1] != file2[j])
                {


                }

                    percent++;
                Console.WriteLine(percent);
            }


            Console.WriteLine("not copied");
        }

    }

so how would make it more efficient by using rolling hash function because that is better than this..

like image 351
Rdx Avatar asked Dec 08 '11 21:12

Rdx


People also ask

What is the use of Rabin Karp algorithm?

Rabin Karp algorithm is one of the optimized algorithms of the naive algorithm that performs searching by rolling over the string and search the pattern. Calculate the hash value of the pattern (By creating your own hash function or equation to determining an individual hash value for every character)

What is the worst case complexity of Rabin Karp?

Rabin-Karp Algorithm Complexity The average case and best case complexity of Rabin-Karp algorithm is O (m + n) and the worst case complexity is O (mn). The worst-case complexity occurs when spurious hits occur a number for all the windows.

What is the difference between naive and Rabin-Karp algorithm?

Like the Naive Algorithm, Rabin-Karp algorithm also slides the pattern one by one. But unlike the Naive algorithm, Rabin Karp algorithm matches the hash value of the pattern with the hash value of current substring of text, and if the hash values match then only it starts matching individual characters.

Which algorithm determines hash value based on K-Gram?

Rabin-Karp algorithm determines hash value based on the same word K-Gram. K-grams are k-length subsequences of a string, it can be 1,2,3,4 E.T.C. These are mostly used for spelling correction, but in this case, we are using k-gram to determine plagiarism pattern between two documents.


1 Answers

The Wikipedia article has a reasonably good discussion of the algorithm, and even mentions how you can implement the rolling hash function (see "Use of hashing for shifting substring search"). It also addresses how to improve runtime speed using a hash table or Bloom filter.

You also have to understand that the worst case is a fairly contrived example. The example given in the Wikipedia article is 'searching for a string of 10,000 "a"s followed by a "b" in a string of 10 million "a"s.'

You should be able to implement the rolling hash using the techniques described in that Wikipedia entry. If you're having trouble implementing that, leave a more specific question about how it's done, showing what you've tried.

It's unlikely that you'll encounter anything approaching the worst case in real-world documents. Even if you were to encounter the worst case, the rolling hash will not reduce the complexity. Implementing the rolling hash gives a linear improvement in runtime, which will be swamped by the n*m complexity. If you find that the worst case happens often, then you probably need a different algorithm.

The other thing to note is that, whereas O(m*n) can be a problem, you have to look at the scale. How large are the documents you're examining? You say you're working with source code files. If you're looking at typical class projects, then you're probably talking maybe 2,000 lines of code. Those documents aren't going to exhibit the worst case. Even if they did, n*m isn't going to be a very large number.

However, if you have 100 documents and you want to know if any one is a substantial duplicate of the other, your larger problem is O(n^2) because you have to check every document against all the others. The number of document comparisons is equal to (n*(n-1))/2. If you're looking to optimize your process, you need a different algorithm. Ideally, something that will give you a "fingerprint" of a document. That way, you can compute the fingerprint for each document one time, and then compare the fingerprints for similarity.

Document fingerprinting is a well known problem. However, constructing a fingerprint that's useful for comparison purposes is a bit less straightforward. You'd want to look into a technique called shingling. I also saw some research about using a small Bloom filter (256 bytes or so) to represent a document, and the ability to do fast comparisons using that.

All that said, I suspect that if you're talking a hundred or two source code files that are each maybe 1,000 or 2,000 lines long, the naive O(n^2) comparison technique using a good Rabin-Carp implementation will do what you want. It will take some time (you're going to do 5,000 separate document comparisons), but I don't think the speed of the R-K implementation will be your limiting factor.

like image 87
Jim Mischel Avatar answered Sep 22 '22 16:09

Jim Mischel