Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to develop a Plagiarism detector?

Tags:

projects

I am planning to make a Plagiarism Detector as my Computer Science Engineering final year project,for which I would like to take your suggestions on how to go about it.

I would appreciate if you could suggest which all fields in CS I need to focus on and also the language which would be the most appropriate to implement in.

like image 344
deovrat singh Avatar asked Jul 28 '09 11:07

deovrat singh


People also ask

What algorithms are used to detect plagiarism?

The algorithms, normally, used in plagiarism detection software are string tiling, Karp-Rabin algorithm, Haeckel's algorithm, k-grams, string matching algorithm [11].

How does a plagiarism detector work?

The way that plagiarism detection software works is to identify content similarity matches. That is, the software scans a database of crawled content and identifies the text components and then compares it to the components, or content, of other work.

Are plagiarism detectors legit?

The accuracy depends on the plagiarism checker you use. Per our in-depth research, Scribbr is the most accurate plagiarism checker. Many free plagiarism checkers fail to detect all plagiarism or falsely flag text as plagiarism.


2 Answers

The language is nearly irrelevant. Another questions exists that discusses this a bit more. Basically, the method suggested there is to use Google. Extract parts of the target-text, and search for them on Google.

like image 115
Sampson Avatar answered Sep 28 '22 01:09

Sampson


I am making a plagiarism checker using Python as a hobby project. The following steps are to be followed:

  1. Tokenize the document.

  2. Remove all the stop words using NLTK library.

  3. Use GenSim library and find the most relevant words, line by line. This can be done by creating the LDA or LSA of the document.

  4. Use Google Search API to search for those words.

Note: you might have chosen to use the Google API and search the whole document at once. This will work when you are working with smaller amount of data. However when building plagiarism checker for sites and webscraped data, we will need to apply NLTK algorithms.

The Google search API will result in the top articles which have the same words which were resulted in the LDA or LSA from GenSim library functions of Python.

Hope it helped.

like image 36
Sumukh Bhandarkar Avatar answered Sep 28 '22 00:09

Sumukh Bhandarkar