Algorithm to detect similar documents in python script [closed]

Tags:

I need to write a module to detect similar documents. I have read many papers of fingerprints of documents techniques and others, but I do not know how to write code or implement such a solution. The algorithm should work for Chinese, Japanese, English and German language or be language independent. How can I accomplish this?

541

asked Sep 19 '08 12:09

user17451

2 Answers

Bayesian filters have exactly this purpose. That's the techno you'll find in most tools that identify spam.

Example, to detect a language (from http://sebsauvage.net/python/snyppets/#bayesian) :

from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('french','La souris est rentrée dans son trou.')
guesser.train('english','my tailor is rich.')
guesser.train('french','Je ne sais pas si je viendrai demain.')
guesser.train('english','I do not plan to update my website soon.')

>>> print guesser.guess('Jumping out of cliffs it not a good idea.')
[('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]

>>> print guesser.guess('Demain il fera très probablement chaud.')
[('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]

But it works to detect any type you will train it for : technical text, songs, jokes, etc. As long as you can provide enought material to let the tool learn what does you document looks like.

answered Sep 20 '22 00:09

e-satis

If these are pure text documents, or you have a method to extract the text from the documents, you can use a technique called shingling.

You first compute a unique hash for each document. If these are the same, you are done.

If not, you break each document down into smaller chunks. These are your 'shingles.'

Once you have the shingles, you can then compute identity hashes for each shingle and compare the hashes of the shingles to determine if the documents are actually the same.

The other technique you can use is to generate n-grams of the entire documents and compute the number of similar n-grams in each document and produce a weighted score for each document. Basically an n-gram is splitting a word into smaller chunks. 'apple' would become ' a', ' ap', 'app', 'ppl', 'ple', 'le '. (This is technically a 3-gram) This approach can become quite computationally expensive over a large number of documents or over two very large documents. Of course, common n-grams 'the', ' th, 'th ', etc need to be weighted to score them lower.

I've posted about this on my blog and there are some links in the post to a few other articles on the subject Shingling - it's not just for roofers.

Best of luck!

answered Sep 19 '22 00:09

Jeremiah Peschka

Related questions
                            
                                Python dynamically importing a script, need to have its __name__ == "__main__" code to be called
                            
                                Python GStreamer webcam viewer
                            
                                Is there a limit on the number of asynchronous urlfetch calls I can run simultaneously?
                            
                                How to pickle a empty file?
                            
                                How to send SMS using Python/Django application?
                            
                                Pickling an enum exposed by Boost.Python
                            
                                Python: is it possible to mix generator and a recursive function?
                            
                                Setting the system date in Python (on Windows)
                            
                                Pass QuerySet object in template. Django
                            
                                Foreign key relationships missing when reflecting db in SqlAlchemy
                            
                                How do I create a unix timestamp that doesn't adjust for localtime?
                            
                                What is the best way to connect to a sybase database from python?
                            
                                How to validate an xml file against an XSD Schema using Amara library in Python?
                            
                                Pythonic way to assign an instance of a subclass to a variable when a specific string is presented to the constructor of the parent class
                            
                                python: simple approach to killing children or reporting their success?
                            
                                How to reverse django feed url?
                            
                                Saving a temporary file
                            
                                Finding patterns in list
                            
                                Plotly animated slider in Python
                            
                                Python: Capitalize a word using string.format()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Algorithm to detect similar documents in python script [closed]

Tags:

python

algorithm

diff

user17451

People also ask

2 Answers

e-satis

Jeremiah Peschka

Recent Activity

Donate For Us