Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate md5 checksum on directory with java or groovy?

I am looking to use java or groovy to get the md5 checksum of a complete directory.

I have to copy directories for source to target, checksum source and target, and after delete source directories.

I find this script for files, but how to do the same thing with directories ?

import java.security.MessageDigest

def generateMD5(final file) {
    MessageDigest digest = MessageDigest.getInstance("MD5")
    file.withInputStream(){ is ->
        byte[] buffer = new byte[8192]
        int read = 0
        while( (read = is.read(buffer)) > 0) {
            digest.update(buffer, 0, read);
        }
    }
    byte[] md5sum = digest.digest()
    BigInteger bigInt = new BigInteger(1, md5sum)

    return bigInt.toString(16).padLeft(32, '0')
}

Is there a better approach ?

like image 444
Fabien Barbier Avatar asked Jun 09 '10 21:06

Fabien Barbier


1 Answers

I had the same requirement and chose my 'directory hash' to be an MD5 hash of the concatenated streams of all (non-directory) files within the directory. As crozin mentioned in comments on a similar question, you can use SequenceInputStream to act as a stream concatenating a load of other streams. I'm using Apache Commons Codec for the MD5 algorithm.

Basically, you recurse through the directory tree, adding FileInputStream instances to a Vector for non-directory files. Vector then conveniently has the elements() method to provide the Enumeration that SequenceInputStream needs to loop through. To the MD5 algorithm, this just appears as one InputStream.

A gotcha is that you need the files presented in the same order every time for the hash to be the same with the same inputs. The listFiles() method in File doesn't guarantee an ordering, so I sort by filename.

I was doing this for SVN controlled files, and wanted to avoid hashing the hidden SVN files, so I implemented a flag to avoid hidden files.

The relevant basic code is as below. (Obviously it could be 'hardened'.)

import org.apache.commons.codec.digest.DigestUtils;

import java.io.*;
import java.util.*;

public String calcMD5HashForDir(File dirToHash, boolean includeHiddenFiles) {

    assert (dirToHash.isDirectory());
    Vector<FileInputStream> fileStreams = new Vector<FileInputStream>();

    System.out.println("Found files for hashing:");
    collectInputStreams(dirToHash, fileStreams, includeHiddenFiles);

    SequenceInputStream seqStream = 
            new SequenceInputStream(fileStreams.elements());

    try {
        String md5Hash = DigestUtils.md5Hex(seqStream);
        seqStream.close();
        return md5Hash;
    }
    catch (IOException e) {
        throw new RuntimeException("Error reading files to hash in "
                                   + dirToHash.getAbsolutePath(), e);
    }

}

private void collectInputStreams(File dir,
                                 List<FileInputStream> foundStreams,
                                 boolean includeHiddenFiles) {

    File[] fileList = dir.listFiles();        
    Arrays.sort(fileList,               // Need in reproducible order
                new Comparator<File>() {
                    public int compare(File f1, File f2) {                       
                        return f1.getName().compareTo(f2.getName());
                    }
                });

    for (File f : fileList) {
        if (!includeHiddenFiles && f.getName().startsWith(".")) {
            // Skip it
        }
        else if (f.isDirectory()) {
            collectInputStreams(f, foundStreams, includeHiddenFiles);
        }
        else {
            try {
                System.out.println("\t" + f.getAbsolutePath());
                foundStreams.add(new FileInputStream(f));
            }
            catch (FileNotFoundException e) {
                throw new AssertionError(e.getMessage()
                            + ": file should never not be found!");
            }
        }
    }

}
like image 78
Stuart Rossiter Avatar answered Oct 10 '22 04:10

Stuart Rossiter