Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find all files with the same content?

This is an interview question: "Given a directory with lots of files, find the files that have the same content". I would propose to use a hash function to generate hash values of the file contents and compare only the files with the same hash values. Does it make sense ?

The next question is how to choose the hash function. Would you use SHA-1 for that purpose ?

like image 912
Michael Avatar asked Sep 18 '25 15:09

Michael


2 Answers

I'd rather use the hash as a second step. Sorting the dir by file size first and hashing and comparing only when there are duplicate sizes may improve a lot your search universe in the general case.

like image 80
Dr. belisarius Avatar answered Sep 22 '25 07:09

Dr. belisarius


Like most interview questions, it's more meant to spark a conversation than to have a single answer.

If there are very few files, it may be faster to simply to a byte-by-byte comparison until you reach bytes which do not match (assuming you do). If there are many files, it may be faster to compute hashes, as you won't have to shift around the disk reading in chunks from multiple files. This process may be sped up by grabbing increasingly large chunks of each file, as you progress through the files eliminating potentials. hIt may also be necessary to distribute the problem among multiple servers, if their are enough files.

I would begin with a much faster and simpler hash function than SHA-1. SHA-1 is cryptographically secure, which is not necessarily required in this case. In my informal tests, Adler 32, for example, is 2-3 times faster. You could also use an even weaker presumptive test, than retest any files which match. This decision also depends on the relation between IO bandwidth and CPU power, if you have a more powerful CPU, use a more specific hash to save having to reread files in subsequent tests, if you have faster IO, the rereads may be cheaper than doing expensive hashes unnecessarily.

Another interesting idea would be to use heuristics on the files as you process them to determine the optimal method, based on the files size, computer's speed, and the file's entropy.

like image 27
Zack Bloom Avatar answered Sep 22 '25 06:09

Zack Bloom