How would you code an anti plagiarism site?

Tags:

First, please note, that I am interested in how something like this would work, and am not intending to build it for a client etc, as I'm sure there may already be open source implementations.

How do the algorithms work which detect plagiarism in uploaded text? Does it use regex to send all words to an index, strip out known words like 'the', 'a', etc and then see how many words are the same in different essays? Does it them have a magic number of identical words which flag it as a possible duplicate? Does it use levenshtein()?

My language of choice is PHP.

UPDATE

I'm thinking of not checking for plagiarism globally, but more say in 30 uploaded essays from a class. In case students have gotten together on a strictly one person assignment.

Here is an online site that claims to do so: http://www.plagiarism.org/

466

asked Jul 05 '09 23:07

alex

2 Answers

Good plagiarism detection will apply heuristics based on the type of document (e.g. an essay or program code in a specific language).

However, you can also apply a general solution. Have a look at the Normalized Compression Distance (NCD). Obviously you cannot exactly calculate a text's Kolmogorov complexity, but you can approach it be simply compressing the text.

A smaller NCD indicates that two texts are more similar. Some compression algorithms will give better results than others. Luckily PHP provides support for several compression algorithms, so you can have your NCD-driven plagiarism detection code running in no-time. Below I'll give example code which uses Zlib:

PHP:

function ncd($x, $y) {    $cx = strlen(gzcompress($x));   $cy = strlen(gzcompress($y));   return (strlen(gzcompress($x . $y)) - min($cx, $cy)) / max($cx, $cy); }     print(ncd('this is a test', 'this was a test')); print(ncd('this is a test', 'this text is completely different'));

Python:

>>> from zlib import compress as c >>> def ncd(x, y):  ...     cx, cy = len(c(x)), len(c(y)) ...     return (len(c(x + y)) - min(cx, cy)) / max(cx, cy)  ...  >>> ncd('this is a test', 'this was a test') 0.30434782608695654 >>> ncd('this is a test', 'this text is completely different') 0.74358974358974361

Note that for larger texts (read: actual files) the results will be much more pronounced. Give it a try and report your experiences!

168

answered Sep 21 '22 12:09

Stephan202

I think that this problem is complicated, and doesn't have one best solution. You can detect exact duplication of words at the whole document level (ie someone downloads an entire essay from the web) all the way down to the phrase level. Doing this at the document level is pretty easy - the most trivial solution would take the checksum of each document submitted and compare it against a list of checksums of known documents. After that you could try to detect plagiarism of ideas, or find sentences that were copied directly then changed slightly in order to throw off software like this.

To get something that works at the phrase level you might need to get more sophisticated if want any level of efficiency. For example, you could look for differences in style of writing between paragraphs, and focus your attention to paragraphs that feel "out of place" compared to the rest of a paper.

There are lots of papers on this subject out there, so I suspect there is no one perfect solution yet. For example, these 2 papers give introductions to some of the general issues with this kind of software,and have plenty of references that you could dig deeper into if you'd like.

http://ir.shef.ac.uk/cloughie/papers/pas_plagiarism.pdf

http://proceedings.informingscience.org/InSITE2007/IISITv4p601-614Dreh383.pdf

answered Sep 20 '22 12:09

Peter Recore

Related questions
                            
                                C# console application icon
                            
                                How can i register a global custom editor in Spring-MVC?
                            
                                How to store the result of a command expression in a variable using bat scripts?
                            
                                Find all second level keys in multi-dimensional array in php
                            
                                How to Deploy my Open Source Projects using Maven's Central Repository? [closed]
                            
                                How to disable a form element in a Zend Form?
                            
                                c# stack queue combination
                            
                                What do I need to do to link with xlib?
                            
                                Is object creation in getters bad practice?
                            
                                Why is consing in Lisp slow?
                            
                                Core data many-to-many relationship - Predicate question
                            
                                Find broken objects in SQL Server

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With