Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

FileChannel ByteBuffer and Hashing Files

Tags:

java

file

hash

I built a file hashing method in java that takes input string representation of a filepath+filename and then calculates the hash of that file. The hash can be any of the native supported java hashing algo's such as MD2 through SHA-512.

I am trying to eek out every last drop of performance since this method is an integral part of a project I'm working on. I was advised to try using FileChannel instead of a regular FileInputStream.

My original method:

    /**
     * Gets Hash of file.
     * 
     * @param file String path + filename of file to get hash.
     * @param hashAlgo Hash algorithm to use. <br/>
     *     Supported algorithms are: <br/>
     *     MD2, MD5 <br/>
     *     SHA-1 <br/>
     *     SHA-256, SHA-384, SHA-512
     * @return String value of hash. (Variable length dependent on hash algorithm used)
     * @throws IOException If file is invalid.
     * @throws HashTypeException If no supported or valid hash algorithm was found.
     */
    public String getHash(String file, String hashAlgo) throws IOException, HashTypeException {
        StringBuffer hexString = null;
        try {
            MessageDigest md = MessageDigest.getInstance(validateHashType(hashAlgo));
            FileInputStream fis = new FileInputStream(file);

            byte[] dataBytes = new byte[1024];

            int nread = 0;
            while ((nread = fis.read(dataBytes)) != -1) {
                md.update(dataBytes, 0, nread);
            }
            fis.close();
            byte[] mdbytes = md.digest();

            hexString = new StringBuffer();
            for (int i = 0; i < mdbytes.length; i++) {
                hexString.append(Integer.toHexString((0xFF & mdbytes[i])));
            }

            return hexString.toString();

        } catch (NoSuchAlgorithmException | HashTypeException e) {
            throw new HashTypeException("Unsuppored Hash Algorithm.", e);
        }
    }

Refactored method:

    /**
     * Gets Hash of file.
     * 
     * @param file String path + filename of file to get hash.
     * @param hashAlgo Hash algorithm to use. <br/>
     *     Supported algorithms are: <br/>
     *     MD2, MD5 <br/>
     *     SHA-1 <br/>
     *     SHA-256, SHA-384, SHA-512
     * @return String value of hash. (Variable length dependent on hash algorithm used)
     * @throws IOException If file is invalid.
     * @throws HashTypeException If no supported or valid hash algorithm was found.
     */
    public String getHash(String fileStr, String hashAlgo) throws IOException, HasherException {

        File file = new File(fileStr);

        MessageDigest md = null;
        FileInputStream fis = null;
        FileChannel fc = null;
        ByteBuffer bbf = null;
        StringBuilder hexString = null;

        try {
            md = MessageDigest.getInstance(hashAlgo);
            fis = new FileInputStream(file);
            fc = fis.getChannel();
            bbf = ByteBuffer.allocate(1024); // allocation in bytes

            int bytes;

            while ((bytes = fc.read(bbf)) != -1) {
                md.update(bbf.array(), 0, bytes);
            }

            fc.close();
            fis.close();

            byte[] mdbytes = md.digest();

            hexString = new StringBuilder();

            for (int i = 0; i < mdbytes.length; i++) {
                hexString.append(Integer.toHexString((0xFF & mdbytes[i])));
            }

            return hexString.toString();

        } catch (NoSuchAlgorithmException e) {
            throw new HasherException("Unsupported Hash Algorithm.", e);
        }
    }

Both return a correct hash, however the refactored method only seems to cooperate on small files. When i pass in a large file, it completely chokes out and I can't figure out why. I'm new to NIO so please advise.

EDIT: Forgot to mention I'm throwing SHA-512's through it for testing.

UPDATE: Updating with my now current method.

    /**
     * Gets Hash of file.
     * 
     * @param file String path + filename of file to get hash.
     * @param hashAlgo Hash algorithm to use. <br/>
     *     Supported algorithms are: <br/>
     *     MD2, MD5 <br/>
     *     SHA-1 <br/>
     *     SHA-256, SHA-384, SHA-512
     * @return String value of hash. (Variable length dependent on hash algorithm used)
     * @throws IOException If file is invalid.
     * @throws HashTypeException If no supported or valid hash algorithm was found.
     */
    public String getHash(String fileStr, String hashAlgo) throws IOException, HasherException {

        File file = new File(fileStr);

        MessageDigest md = null;
        FileInputStream fis = null;
        FileChannel fc = null;
        ByteBuffer bbf = null;
        StringBuilder hexString = null;

        try {
            md = MessageDigest.getInstance(hashAlgo);
            fis = new FileInputStream(file);
            fc = fis.getChannel();
            bbf = ByteBuffer.allocateDirect(8192); // allocation in bytes - 1024, 2048, 4096, 8192

            int b;

            b = fc.read(bbf);

            while ((b != -1) && (b != 0)) {
                bbf.flip();

                byte[] bytes = new byte[b];
                bbf.get(bytes);

                md.update(bytes, 0, b);

                bbf.clear();
                b = fc.read(bbf);
            }

            fis.close();

            byte[] mdbytes = md.digest();

            hexString = new StringBuilder();

            for (int i = 0; i < mdbytes.length; i++) {
                hexString.append(Integer.toHexString((0xFF & mdbytes[i])));
            }

            return hexString.toString();

        } catch (NoSuchAlgorithmException e) {
            throw new HasherException("Unsupported Hash Algorithm.", e);
        }
    }

So I attempted a benchmark hashing out the MD5 of a 2.92GB file using my original example and my latest update's example. Of course any benchmark is relative since there is OS and disk caching and other "magic" going on that will skew repeated reads of the same files... but here's a shot at some benchmarks. I loaded each method up and fired it off 5 times after compiling it fresh. The benchmark was taken from the last (5th) run as this would be the "hottest" run for that algorithm, and any "magic" (in my theory anyways).

Here's the benchmarks so far: 

    Original Method - 14.987909 (s) 
    Latest Method - 11.236802 (s)

That is a 25.03% decrease in time taken to hash the same 2.92GB file. Pretty good.

like image 773
SnakeDoc Avatar asked Apr 17 '13 03:04

SnakeDoc


People also ask

What is ByteBuffer used for?

A ByteBuffer is a buffer which provides for transferring bytes from a source to a destination. In addition to storage like a buffer array, it also provides abstractions such as current position, limit, capacity, etc. A FileChannel is used for transferring data to and from a file to a ByteBuffer.

How do I write ByteBuffer to a file?

File access with FileChannel + ByteBuffer To write data into a FileChannel or read from it, you need a ByteBuffer . Data is put into the ByteBuffer with put() and then written from the buffer to the file with FileChannel. write(buffer) . FileChannel.

How do I get data from ByteBuffer?

In order to get the byte array from ByteBuffer just call the ByteBuffer. array() method. This method will return the backed array. Now you can call the String constructor which accepts a byte array and character encoding to create String.

What is FileChannel?

A file channel is a SeekableByteChannel that is connected to a file. It has a current position within its file which can be both queried and modified . The file itself contains a variable-length sequence of bytes that can be read and written and whose current size can be queried.


2 Answers

3 suggestions:

1) clear buffer after each read

while (fc.read(bbf) != -1) {
    md.update(bbf.array(), 0, bytes);
    bbf.clear();
}

2) do not close both fc and fis, it's redundant, closing fis is enough. FileInputStream.close API says:

If this stream has an associated channel then the channel is closed as well.

3) if you want performance improvement with FileChannel use

ByteBuffer.allocateDirect(1024); 
like image 94
Evgeniy Dorofeev Avatar answered Oct 03 '22 12:10

Evgeniy Dorofeev


Another possible improvement might come if the code only allocated the temp buffer once.

e.g.

        int bufsize = 8192;
        ByteBuffer buffer = ByteBuffer.allocateDirect(bufsize); 
        byte[] temp = new byte[bufsize];
        int b = channel.read(buffer);

        while (b > 0) {
            buffer.flip();

            buffer.get(temp, 0, b);
            md.update(temp, 0, b);
            buffer.clear();

            b = channel.read(buffer);
        }

Addendum

Note: There is a bug in the string building code. It prints zero as a single digit number. This can easily be fixed. e.g.

hexString.append(mdbytes[i] == 0 ? "00" : Integer.toHexString((0xFF & mdbytes[i])));

Also, as an experiment, I rewrote the code to use mapped byte buffers. It runs about 30% faster (6-7 millis v.s. 9-11 millis FWIW). I expect you could get more out of it if you wrote code hashing code that operated directly on the byte buffer.

I attempted to account for JVM initialization and file system caching by hashing a different file with each algorithm before starting the timer. The first run through the code is about 25 times slower than a normal run. This appears to be due to JVM initialization, because all runs in the timing loop are roughly the same length. They do not appear to benefit from caching. I tested with the MD5 algorithm. Also, during the timing portion, only one algorithm is run for the duration of the test program.

The code in the loop is shorter, so potentially more understandable. I'm not 100% certain what kind of pressure memory mapping many files under high volume would exert on the JVM, so that might be something you would need to research and consider if you wanted to consider this sort of solution if you wanted to run this under load.

public static byte[] hash(File file, String hashAlgo) throws IOException {

    FileInputStream inputStream = null;

    try {
        MessageDigest md = MessageDigest.getInstance(hashAlgo);
        inputStream = new FileInputStream(file);
        FileChannel channel = inputStream.getChannel();

        long length = file.length();
        if(length > Integer.MAX_VALUE) {
            // you could make this work with some care,
            // but this code does not bother.
            throw new IOException("File "+file.getAbsolutePath()+" is too large.");
        }

        ByteBuffer buffer = channel.map(MapMode.READ_ONLY, 0, length);

        int bufsize = 1024 * 8;          
        byte[] temp = new byte[bufsize];
        int bytesRead = 0;

        while (bytesRead < length) {
            int numBytes = (int)length - bytesRead >= bufsize ? 
                                         bufsize : 
                                         (int)length - bytesRead;
            buffer.get(temp, 0, numBytes);
            md.update(temp, 0, numBytes);
            bytesRead += numBytes;
        }

        byte[] mdbytes = md.digest();
        return mdbytes;

    } catch (NoSuchAlgorithmException e) {
        throw new IllegalArgumentException("Unsupported Hash Algorithm.", e);
    }
    finally {
        if(inputStream != null) {
            inputStream.close();
        }
    }
}
like image 21
Bill Avatar answered Oct 03 '22 14:10

Bill