Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which hash to use for file uniqueness in Java

Tags:

java

hash

I'm trying to keep track of a set of files, which may have the same name and metadata. I'd like to use a hash to differentiate and use it as a unique ID, but I'm not sure which one to use? The files are relatively small (in the 100 kb range) and I'd like to be able to hash that in less than 10 seconds. Which hash (that comes built in in Java 1.5) would best suite my needs?

like image 923
C. Ross Avatar asked Nov 23 '09 21:11

C. Ross


2 Answers

Note that a hash of this sort will never be unique though, with the use off an effective one you stand a very good chance of never having a collision.

If you are not concerned with security (i.e. someone deliberately trying to break your hashing) then simply using the MD5 hash will give you an excellent hash with minimal effort.

It is likely that you could do an SHA hash of 100Kb in well less than 10 second though and, though SHA-1 is still theoretically flawed it is of higher strength than MD5.

MessageDigest will get you an implementation of either.

Here are some examples of using it with streams.

Also I should note that this excellent answer from jarnbjo would indicate that even the supplied SHA hashing in Java are well capable of exceeding 20MB/s even on relatively modest x86 hardware. This would imply 5-10 millisecond level performance on 100KB of (in memory) input data so your target of under 10seconds is a massive overestimate of the effort involved. It is likely you will be entirely limited by the rate you can read the files from disk rather than any hashing algorithm you use.

If you have a need for strong crypto hashing you should indicate this in the question. Even then SHA of some flavour above 1 is still likely to be your best bet unless you wish to use an external library like Bouncy Castle since you should never try to roll your own crypto if a well established implementation exists.

For some reasonably efficient sample code I suggest this how to The salient points of which can be distilled into the following (tune the buffer size as you see fit):

import java.io.*;
import java.security.MessageDigest;

public class Checksum 
{    
    const string Algorithm = "SHA-1"; // or MD5 etc.

    public static byte[] createChecksum(String filename) throws
       Exception
    {
        InputStream fis =  new FileInputStream(filename);
        try
        {
             byte[] buffer = new byte[1024];
             MessageDigest complete = MessageDigest.getInstance("MD5"); 
             int numRead;
             do 
             {
                 numRead = fis.read(buffer);
                 if (numRead > 0) 
                 {
                     complete.update(buffer, 0, numRead);
                 }
             } while (numRead != -1);
             return complete.digest();
         }
         finally
         {
             fis.close();
         }
     }
}
like image 139
ShuggyCoUk Avatar answered Sep 28 '22 04:09

ShuggyCoUk


you could use MessageDigest with SHA1:

    MessageDigest messageDigest = MessageDigest.getInstance("SHA1");
    InputStream is = new FileInputStream(aFile);
    int res;

    while ((res = inputStream.read()) != -1) {
        digester.update((byte) res);
    }

    byte[] digest = messageDigest.digest();
like image 24
dfa Avatar answered Sep 28 '22 04:09

dfa