What is the fastest way to check if files are identical?

Tags:

If you have 1,000,0000 source files, you suspect they are all the same, and you want to compare them what is the current fasted method to compare those files? Assume they are Java files and platform where the comparison is done is not important. cksum is making me cry. When I mean identical I mean ALL identical.

Update: I know about generating checksums. diff is laughable ... I want speed.

Update: Don't get stuck on the fact they are source files. Pretend for example you took a million runs of a program with very regulated output. You want to prove all 1,000,000 versions of the output are the same.

Update: read the number of blocks rather than bytes? Immediatly throw out those? Is that faster than finding the number of bytes?

Update: Is this ANY different than the fastest way to compare two files?

301

asked Apr 24 '09 05:04

ojblass

1 Answers

I'd opt for something like the approach taken by the cmp program: open two files (say file 1 and file 2), read a block from each, and compare them byte-by-byte. If they match, read the next block from each, compare them byte-by-byte, etc. If you get to the end of both files without detecting any differences, seek to the beginning of file 1, close file 2 and open file 3 in its place, and repeat until you've checked all files. I don't think there's any way to avoid reading all bytes of all files if they are in fact all identical, but I think this approach is (or is close to) the fastest way to detect any difference that may exist.

OP Modification: Lifted up important comment from Mark Bessey

"another obvious optimization if the files are expected to be mostly identical, and if they're relatively small, is to keep one of the files entirely in memory. That cuts way down on thrashing trying to read two files at once."

159

answered Sep 28 '22 08:09

David Z

Related questions
                            
                                How to use Dependency Injection without breaking encapsulation?
                            
                                Data structure for storing recurring events?
                            
                                When is it okay to check if a file exists?
                            
                                Definition of "synchronization primitive"
                            
                                What is a stack overflow?
                            
                                What are some good resources for learning threaded programming? [closed]
                            
                                Literature and tutorials for writing a ray tracer
                            
                                Does float have a negative zero? (-0f)
                            
                                How to terminate a program when it crashes? (which should just fail a unit test instead of getting stuck forever)
                            
                                Why are relational set-based queries better than cursors?
                            
                                In what situations is octal base used?
                            
                                "Rounding" colour values to the nearest of a small set of colours
                            
                                Can knowing C actually hurt the code you write in higher level languages?
                            
                                Code Golf: Finite-state machine!
                            
                                why memoization is not a language feature?
                            
                                Getting started with programmatic audio [closed]
                            
                                Drawing a Topographical Map
                            
                                What is instrumentation?
                            
                                What is the difference between message-passing and method-invocation?
                            
                                Explain the difference between a data *structure* and a data *type* [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the fastest way to check if files are identical?

Tags:

language-agnostic

file

comparison

ojblass

People also ask

1 Answers

David Z

Recent Activity

Donate For Us