First, please note, that I am interested in how something like this would work, and am not intending to build it for a client etc, as I'm sure there may already be open source implementations.
How do the algorithms work which detect plagiarism in uploaded text? Does it use regex to send all words to an index, strip out known words like 'the', 'a', etc and then see how many words are the same in different essays? Does it them have a magic number of identical words which flag it as a possible duplicate? Does it use levenshtein()?
My language of choice is PHP.
UPDATE
I'm thinking of not checking for plagiarism globally, but more say in 30 uploaded essays from a class. In case students have gotten together on a strictly one person assignment.
Here is an online site that claims to do so: http://www.plagiarism.org/
Detecting Plagiarism in Code To detect plagiarized code, the most popular tool is the MOSS system. (If you already know you want to use MOSS this quarter, skip to "Getting Started" below). Using MOSS involves packaging up students' solutions, submitting them for automated examination, and reviewing the results.
Source code plagiarism—otherwise known as programming plagiarism—is, simply put, using (aka copying or adapting) another person's source code and claiming it as your own without attribution.
A plagiarism checker uses advanced database software to scan for matches between your text and existing texts. They are used by universities to scan student assignments. There are also commercial plagiarism checkers you can use to check your own work before submitting.
Good plagiarism detection will apply heuristics based on the type of document (e.g. an essay or program code in a specific language).
However, you can also apply a general solution. Have a look at the Normalized Compression Distance (NCD). Obviously you cannot exactly calculate a text's Kolmogorov complexity, but you can approach it be simply compressing the text.
A smaller NCD indicates that two texts are more similar. Some compression algorithms will give better results than others. Luckily PHP provides support for several compression algorithms, so you can have your NCD-driven plagiarism detection code running in no-time. Below I'll give example code which uses Zlib:
PHP:
function ncd($x, $y) { $cx = strlen(gzcompress($x)); $cy = strlen(gzcompress($y)); return (strlen(gzcompress($x . $y)) - min($cx, $cy)) / max($cx, $cy); } print(ncd('this is a test', 'this was a test')); print(ncd('this is a test', 'this text is completely different'));
Python:
>>> from zlib import compress as c >>> def ncd(x, y): ... cx, cy = len(c(x)), len(c(y)) ... return (len(c(x + y)) - min(cx, cy)) / max(cx, cy) ... >>> ncd('this is a test', 'this was a test') 0.30434782608695654 >>> ncd('this is a test', 'this text is completely different') 0.74358974358974361
Note that for larger texts (read: actual files) the results will be much more pronounced. Give it a try and report your experiences!
I think that this problem is complicated, and doesn't have one best solution. You can detect exact duplication of words at the whole document level (ie someone downloads an entire essay from the web) all the way down to the phrase level. Doing this at the document level is pretty easy - the most trivial solution would take the checksum of each document submitted and compare it against a list of checksums of known documents. After that you could try to detect plagiarism of ideas, or find sentences that were copied directly then changed slightly in order to throw off software like this.
To get something that works at the phrase level you might need to get more sophisticated if want any level of efficiency. For example, you could look for differences in style of writing between paragraphs, and focus your attention to paragraphs that feel "out of place" compared to the rest of a paper.
There are lots of papers on this subject out there, so I suspect there is no one perfect solution yet. For example, these 2 papers give introductions to some of the general issues with this kind of software,and have plenty of references that you could dig deeper into if you'd like.
http://ir.shef.ac.uk/cloughie/papers/pas_plagiarism.pdf
http://proceedings.informingscience.org/InSITE2007/IISITv4p601-614Dreh383.pdf
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With