Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detecting duplicate files

I'd like to detect duplicate files in a directory tree. When two identical files are found only one of the duplicates will be preserved and the remaining duplicates will be deleted to save the disk space.

The duplicate means files having the same content which may differ in file names and path.

I was thinking about using hash algorithms for this purpose but there is a chance that different files have the same hashes, so I need some additional mechanism to tell me that the files aren't the same even though the hashes are the same because I don't want to delete two different files.

Which additional fast and reliable mechanism would you use?

like image 510
xralf Avatar asked Mar 21 '12 15:03

xralf


1 Answers

Calculating hash will make your program run slow. Its better you also check the file size. All the duplicate file should have same file size. If they share same file size apply hash check. It'll make your program perform fast.

There can be more steps.

  1. Check if file size is equal
  2. If step 1 passes, check if first and last range of bytes (say 100 bytes) are equal
  3. If step 2 passes, check file type,
  4. If step 3 passes, check the hash at last

The more criteria you add the more faster it'll perform and you can avoid the last resort (hash) this way.

like image 104
Shiplu Mokaddim Avatar answered Sep 22 '22 23:09

Shiplu Mokaddim