I am building a web tool to check whether the submitted content is taken from web or is it submitter own work. A plagiarism detector.
I have some idea that I can generated check sum and use that as a key to compare with other entries. However, if someone has made some small changes like including/removing comments, changing variables/function name and so on then the checksum will be different, so this approach won't work.
Any suggestions for a better way?
The Codeleaks web-based plagiarism checker PHP source code works with over 20 computer languages to detect both accidental and purposeful plagiarism. How do I use Codeleaks? When using a python plagiarism checker online tool, you can submit your original file to us for checking.
Copying the source code or the code behind any page is illegal, and developers must be careful not simply to recreate and then host copied codes. Font can be copied and used in similar design layouts, but if it is not original, rather an exact copy of the original, it falls under web design plagiarism as well.
It is also considered plagiarism if you take program code written by another person and present it as your own work. Almost all computer programs contain many ideas borrowed from elsewhere. Many also contain short sections of actual code copied from elsewhere.
Plagiarism detection is a special case of similarity detection. This is a big field of study that's almost as old as computer science its self. There is a lot of published research, and there just isn't a single simple answer.
See, eg, a Google Scholar search for "code similarity plagiarism" or "plagiarism detection". Regular Google searches for things like "source code similarity detection algorithm" can also be useful.
There are plenty of existing tools in the space, too, so I'm surprised you're trying to write your own.
As you've noted, a check-sum won't do the job unless the code is perfectly identical. Techniques that can help include:
Building word-frequency histograms and comparing them
Extracting comment text and looking for copied comments using text-substring matching
Extracting variable, class and method names and looking for other code that uses the same names. You have to do a lot of correction for "obvious" names that everyone will choose, and for names that're dictated by the problem, like implementing a particular interface or API. Private class member variables and the local variables inside a function or method are the most useful to compare. You will need the help of a compiler or at least syntax parser for the language to extract these.
Looking for differences in indenting style. Did the user use all-spaces indenting, except for this one function that's indented with tabs?
Comparing parse trees or token streams to strip out the effects of formatting. You'd usually have to compare individual functions, etc, not just the code as a whole.
... and lots more
What you'll have to do is produce a report that weighs all these factors and others and presents them to a human so the human can make a decision. Your tool should explain why it thinks two results are similar, not just that they are similar.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With