Rabin–Karp algorithm for plagiarism implementation by using rolling hash

Tags:

i am using Rabin–Karp algorithm to check plagiarism for any two source code files so firstly i simply implement its algorithm in c # here its code but its average and best case running time is O(n+m) in space O(p), but its worst-case time is O(nm).

 public void plagiarism(string [] file1, string [] file2)
    {
        int percent = 0;

        for (int i = 0; i <(file1.Length - file2.Length +1); i++)
        {

            for (int j = 0; j < file1.Length; j++)
            {
                if (file1[i + j - 1] != file2[j])
                {


                }

                    percent++;
                Console.WriteLine(percent);
            }


            Console.WriteLine("not copied");
        }

    }

so how would make it more efficient by using rolling hash function because that is better than this..

351

asked Dec 08 '11 21:12

Rdx

1 Answers

The Wikipedia article has a reasonably good discussion of the algorithm, and even mentions how you can implement the rolling hash function (see "Use of hashing for shifting substring search"). It also addresses how to improve runtime speed using a hash table or Bloom filter.

You also have to understand that the worst case is a fairly contrived example. The example given in the Wikipedia article is 'searching for a string of 10,000 "a"s followed by a "b" in a string of 10 million "a"s.'

You should be able to implement the rolling hash using the techniques described in that Wikipedia entry. If you're having trouble implementing that, leave a more specific question about how it's done, showing what you've tried.

It's unlikely that you'll encounter anything approaching the worst case in real-world documents. Even if you were to encounter the worst case, the rolling hash will not reduce the complexity. Implementing the rolling hash gives a linear improvement in runtime, which will be swamped by the n*m complexity. If you find that the worst case happens often, then you probably need a different algorithm.

The other thing to note is that, whereas O(m*n) can be a problem, you have to look at the scale. How large are the documents you're examining? You say you're working with source code files. If you're looking at typical class projects, then you're probably talking maybe 2,000 lines of code. Those documents aren't going to exhibit the worst case. Even if they did, n*m isn't going to be a very large number.

However, if you have 100 documents and you want to know if any one is a substantial duplicate of the other, your larger problem is O(n^2) because you have to check every document against all the others. The number of document comparisons is equal to (n*(n-1))/2. If you're looking to optimize your process, you need a different algorithm. Ideally, something that will give you a "fingerprint" of a document. That way, you can compute the fingerprint for each document one time, and then compare the fingerprints for similarity.

Document fingerprinting is a well known problem. However, constructing a fingerprint that's useful for comparison purposes is a bit less straightforward. You'd want to look into a technique called shingling. I also saw some research about using a small Bloom filter (256 bytes or so) to represent a document, and the ability to do fast comparisons using that.

All that said, I suspect that if you're talking a hundred or two source code files that are each maybe 1,000 or 2,000 lines long, the naive O(n^2) comparison technique using a good Rabin-Carp implementation will do what you want. It will take some time (you're going to do 5,000 separate document comparisons), but I don't think the speed of the R-K implementation will be your limiting factor.

answered Sep 22 '22 16:09

Jim Mischel

Related questions
                            
                                Can I determine interface requirements/dependencies/inheritance programatically?
                            
                                Creating NSMenuItems programmatically in MonoMac
                            
                                Using c# generics in a nested class
                            
                                Comparing two tables of local database using C#
                            
                                Linq to Sql with lambda sum as a where condition
                            
                                Resource interpreted as Other but transferred with MIME type undefined. error for IE (Asp.net Website)
                            
                                TcpClient connect fails with IPv6Any
                            
                                Combining struct and new() generic type constraints
                            
                                Implementing Generic Extension Method for Generic Type
                            
                                How to Change the Schema name
                            
                                XNA Windows Setup project not including my content project
                            
                                SQLite "Database is locked" error in multithreads application
                            
                                programmatically comparing word documents
                            
                                Monotouch and WCF: difference of SVCUTIL.EXE and SLSVCUTIL.EXE and how to avoid unsupported generic ChannelFactory?
                            
                                Is 'yield' keyword a syntactic sugar ? What is its Implementation [duplicate]
                            
                                Datatable inside using?
                            
                                Class Documentation Suggestions
                            
                                Binding keyword refers to what?
                            
                                Deserialize Inherited class with DataContractSerializer(typeof(BaseClass))
                            
                                design pattern to implement a set of permissions for a user

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Rabin–Karp algorithm for plagiarism implementation by using rolling hash

Tags:

c#

algorithm

data-structures

rabin-karp

Rdx

People also ask

1 Answers

Jim Mischel

Recent Activity

Donate For Us