Detecting duplicate files

Question

I'd like to detect duplicate files in a directory tree. When two identical files are found only one of the duplicates will be preserved and the remaining duplicates will be deleted to save the disk space.

The duplicate means files having the same content which may differ in file names and path.

I was thinking about using hash algorithms for this purpose but there is a chance that different files have the same hashes, so I need some additional mechanism to tell me that the files aren't the same even though the hashes are the same because I don't want to delete two different files.

Which additional fast and reliable mechanism would you use?

Shiplu Mokaddim · Accepted Answer

Calculating hash will make your program run slow. Its better you also check the file size. All the duplicate file should have same file size. If they share same file size apply hash check. It'll make your program perform fast.

There can be more steps.

Check if file size is equal
If step 1 passes, check if first and last range of bytes (say 100 bytes) are equal
If step 2 passes, check file type,
If step 3 passes, check the hash at last

The more criteria you add the more faster it'll perform and you can avoid the last resort (hash) this way.

Detecting duplicate files

Tags:

algorithm

duplicates

hash

xralf

1 Answers

Shiplu Mokaddim

Recent Activity

Donate For Us

Detecting duplicate files

Tags:

algorithm

duplicates

hash

xralf

1 Answers

Shiplu Mokaddim

Related questions

Recent Activity

Donate For Us