Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determining whether a file is a duplicate

Is there a reliable way to determine whether or not two files are the same? For example, two files with the same size and type may or may not be the same binarilly (yeah, I know it's not really a word). I assume that comparing one or two checksums of the files will help, but I wonder:

  1. How reliable are checksums at determining whether two files are different; what are the chances of two different files having the same checksum?
  2. Would reliability increase by applying additional checksum comparisons?
  3. Which checksum algorithm(s) would be the most efficient and/or reliable?

Any ideas, suggestions or thoughts are appreciated!

P.S. The code for this is being written in Java running on a nix system, but generic or platform agnostic input is most helpful.

like image 272
Todd R Avatar asked May 11 '10 17:05

Todd R


People also ask

Does Windows 10 have a duplicate file finder?

Answer: No, Windows 10 does not have a duplicate finder in it yet.

Does Windows 11 have a duplicate file finder?

No, Windows does not have a built-in duplicate file finder tool. Therefore, getting a dedicated tool to find and remove duplicate files from your computer is important. Get Duplicate Files Fixer to run a scan on internal and external hard drives, mobile devices, and cloud storage to delete duplicate files.


2 Answers

It's impossible to know with certainty whether or not two files are the same unless you compare them byte for byte. It's similar to how you can't guarantee that a collection does or doesn't contain a given object unless you check every item in the collection.

Checksums are basically a hash. Whether they're good enough for your purposes depends on how mission-critical your app is. It's certainly possible to create a hash function with low risk of collision; after all, passwords are hashed, even in situations where they protect sensitive data and you wouldn't want to have a second valid password on your account. Unless you're writing code for, say, a bank, a strong checksum algorithm should provide a very good approximation.

Using multiple checksums will increase reliability if and only if the different checksum algorithms use dissimilar hash functions.

Your third question has already been taken care of by leonbloy's answer; MD5 and SHA-1 are common.

like image 89
Pops Avatar answered Sep 22 '22 02:09

Pops


1) Very reliable
2) Not theoretically
3) SHA-1
like image 40
zaf Avatar answered Sep 18 '22 02:09

zaf