Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Similar code detector

I'm search for a tool that could compare source codes for similarity.

We have a very trivial system right now that has huge amount of false positives and the real positives can easily get buried in them.

My requirements are:

  • reasonably small amount of false positives
  • good detection rate (yeah these are going against each other)
  • ideally with a more complex output than just a single value
  • usable for C (C99) and C++ (C++03 and optimally C++11)
  • still maintained
  • usable for comparing two source files against each other
  • usable in non-interactive mode

EDIT:

To avoid confusion, the following two code snippets are identical and should be detected as such:

for (int i = 0; i < 10; i++) { bla; }

int i; while (i < 10) { bla; i++; }

The same here:

int x = 10; y = x + 5;

int a = 10; y = a + 5;

like image 593
Šimon Tóth Avatar asked Jun 06 '12 10:06

Šimon Tóth


People also ask

How do I find copied codes?

The Codeleaks web-based plagiarism checker PHP source code works with over 20 computer languages to detect both accidental and purposeful plagiarism. How do I use Codeleaks? When using a python plagiarism checker online tool, you can submit your original file to us for checking.

Is Codequiry better than Moss?

If there is any new technology that is challenging the traditional code plagiarism checking techniques, it is Codequiry. If you have heard of or have been using Measure of Software Similarity (MOSS), Codequiry is very similar to MOSS but only many times better than MOSS.

Is CopyLeaks accurate for code?

Is CopyLeaks Accurate for Code? Yes. There may be other tools available to check programming code for plagiarism but one of the significant advantages of our tool is its accuracy – even with paraphrased content!


2 Answers

I've used MOSS in the past: http://theory.stanford.edu/~aiken/moss/ to detect plagiarized code. Since it works on a semantic level, it will detect the situations you presented above. The tool is language-aware, so comments are not considered in the analysis, and it goes a long way in detecting code that has been modified through simple search-and-replace of variable and/or function names.

Note: I used the tool a few years ago when I taught computer science in grad school, and it worked wonderfully in detecting code that had been yanked from the internet. Here is a well-documented account of similar application: http://fie2012.org/sites/fie2012.org/history/fie99/papers/1110.pdf

If you google "measure software similarity", you should find a few more useful hits: http://www.ics.heacademy.ac.uk/resources/assessment/plagiarism/detectiontools_sourcecode.html

like image 159
Throwback1986 Avatar answered Oct 05 '22 08:10

Throwback1986


Your problem in Computer Science Terminology maybe stated as Source Code Plagiarism Detection. A good start would be to read this article on Dr Dobbs: Detecting Source-Code Plagiarism. It lists the Algorithms for detecting Plagiarism in the source code.

Note: What you have asked for is indeed a tough computing problem :)

like image 36
Yavar Avatar answered Oct 05 '22 08:10

Yavar