Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Checking for document duplicates and similar documents in a document management application

Update: I have now written a PHP extension called php_ssdeep for the ssdeep C API to facilitate fuzzy hashing and hash comparisons in PHP natively. More information can be found over at my blog. I hope this is helpful to people.

I am involved in writing a custom document management application in PHP on a Linux box that will store various file formats (potentially 1000's of files) and we need to be able to check whether a text document has been uploaded before to prevent duplication in the database.

Essentially when a user uploads a new file we would like to be able to present them with a list of files that are either duplicates or contain similar content. This would then allow them to choose one of the pre-existing documents or continue uploading their own.

Similar documents would be determined by looking through their content for similar sentances and perhaps a dynamically generated list of keywords. We can then display a percentage match to the user to help them find the duplicates.

Can you recommend any packages for this process and any ideas of how you might have done this in the past?

The direct duplicate I think can be done by getting all the text content and

  • Stripping whitespace
  • Removing punctuation
  • Convert to lower or upper case

then form an MD5 hash to compare with any new documents. Stripping those items out should help prevent dupes not being found if the user edits a document to add in extra paragraph breaks for example. Any thoughts?

This process could also potentially run as a nightly job and we could notify the user of any duplicates when they next login if the computational requirement is too great to run in realtime. Realtime would be preferred however.

like image 437
Treffynnon Avatar asked Nov 13 '09 12:11

Treffynnon


1 Answers

Update: I have now written a PHP extension called php_ssdeep for the ssdeep C API to facilitate fuzzy hashing and hash comparisons in PHP natively. More information can be found over at my blog. I hope this is helpful to people.

I have found a program that does what its creator, Jesse Kornblum, calls "Fuzzy Hashing". Very basically it makes hashes of a file that can be used to detect similar files or identical matches.

The theory behind it is documented here: Identifying almost identical files using context triggered piecewise hashing

ssdeep is the name of the program and it can be run on Windows or Linux. It was intended for use in forensic computing, but it seems suited enough to our purposes. I have done a short test on an old Pentium 4 machine and it takes about 3 secs to go through a hash file of 23MB (hashes for just under 135,000 files) looking for matches against two files. That time includes creating hashes for the two files I was searching against as well.

like image 127
Treffynnon Avatar answered Sep 30 '22 06:09

Treffynnon