I need to compare two different files of the instance "File" in Java and want to do this with a fast hash function. Idea: - Hashing the 20 first lines in File 1 - Hashing the 20 first lines in File 2 - Compare the two hashes and return true if those are equal. I want to use the "fastest" hash function ever been implemented in Java. Which one would you choose?

If you're comparing two files at the same time on the same system there's no need to hash both of them. Just compare the bytes in both files are equal as you read both. If you're looking to compare them at different times or they're in different places then MD5 would be both fast and adequate. There's not much reason to need a faster one unless you're dealing with really large files. Even my laptop can hash hundreds of megabytes per second. You also need to hash the whole file if you want to verify they're identical. Otherwise you might as well just check the size and last modified time if you want a really quick check. You could also check the beginning and end of the file if they're just really large and you trust that the middle won't be changing. If you're not dealing with hundreds of megabytes though, you may as well check every byte of each file.

"Fastest" hash function implemented in Java, comparing part of file

2 Answers

If you want speed, do not hash! Especially not a cryptographic hash like MD5. These hashes are designed to be impossible to reverse, not fast to calculate. What you should use is a Checksum - see java.util.zip.Checksum and its two concrete implementations. Adler32 is extremely fast to compute.

Any method based on checksums or hashes is vulnerable to collisions, but you can minimise the risk by using two different methods in the way RSYNC does.

The algorithm is basically:

Check file sizes are equal
Break the files into chunks of size N bytes
Compute checksum on each pair of matching blocks and compare. Any differences prove files are not the same.

This allows for early detection of a difference. You can improve it by computing two checksums at once with different algorithms, or different block sizes.

More bits in the result mean less chance of a collision, but as soon as you go over 64 bits you are outside what Java (and the computer's CPU) can handle natively and hence get slow, so FNV-1024 is less likely to give you a false negative but is much slower.

If it is all about speed, just use Adler32 and accept that very rarely a difference will not be detected. It really is rare. Checksums like these are used to ensure the internet can spot transmission errors, and how often do you get the wrong data turning up?

It it is all about accuracy really, you will have to compare every byte. Nothing else will work.

If you can compromise between speed and accuracy, there is a wealth of options out there.

150

answered Sep 28 '22 06:09

Simon G.

If you're comparing two files at the same time on the same system there's no need to hash both of them. Just compare the bytes in both files are equal as you read both. If you're looking to compare them at different times or they're in different places then MD5 would be both fast and adequate. There's not much reason to need a faster one unless you're dealing with really large files. Even my laptop can hash hundreds of megabytes per second.

You also need to hash the whole file if you want to verify they're identical. Otherwise you might as well just check the size and last modified time if you want a really quick check. You could also check the beginning and end of the file if they're just really large and you trust that the middle won't be changing. If you're not dealing with hundreds of megabytes though, you may as well check every byte of each file.

answered Sep 28 '22 06:09

WhiteFang34

Related questions
                            
                                How to handle multiple files and messages for internationalization in Spring?
                            
                                Spring @Autowired on a class new instance
                            
                                How to persist LocalDate with JPA?
                            
                                Why is Java EE scalable?
                            
                                How does synchronized work in Java
                            
                                How can I get a Future<MyObject> without using ExecutorService?
                            
                                Download a file from the internet using java : How to authenticate?
                            
                                How do I suppress Eclipse 3.5's warnings of dead code
                            
                                Unidentified whitespace character in Java
                            
                                How to package Factories in Java
                            
                                Problem using generic map with wildcard
                            
                                Why is PermGen space growing?
                            
                                Is there a recommended way to use the Observer pattern in MVP using GWT?
                            
                                URI scheme is not "file"
                            
                                Java - Collections.sort() performance
                            
                                Do I need extra synchronization when using a BlockingQueue?
                            
                                Resolving relative paths when loading XSLT files
                            
                                Query String Manipulation in Java
                            
                                Java static methods accessing private variables
                            
                                Changing content type in jax-rs REST service

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

"Fastest" hash function implemented in Java, comparing part of file

Tags:

java

performance

comparison

hash-function

carloscloud

People also ask

2 Answers

Simon G.

WhiteFang34

Recent Activity

Donate For Us