Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Confirming file content against hash

I have a requirement to 'check the integrity' of the content of files. The files will be written to CD/DVD, which might be copied many times. The idea is to identify copies (after they are removed from Nero etc.) which copied correctly.

Am rather new to this, but a quick search suggests that Arrays.hashCode(byte[]) will fit the need. We can include a file on the disk that contains the result of that call for each resource of interest, then compare it to the byte[] of the File as read from disk when checked.

Do I understand the method correctly, is this a valid way to go about checking file content?

If not, suggestions as to search keywords or strategies/methods/classes would be appreciated.


Working code based on the answer of Brendan. It takes care of the problem identified by VoidStar (needing to hold the entire byte[] in memory for getting the hash).

import java.io.File;
import java.io.FileInputStream;
import java.util.zip.CRC32;

class TestHash {

    public static void main(String[] args) throws Exception {
        File f = new File("TestHash.java");
        FileInputStream fis = new FileInputStream(f);
        CRC32 crcMaker = new CRC32();
        byte[] buffer = new byte[65536];
        int bytesRead;
        while((bytesRead = fis.read(buffer)) != -1) {
            crcMaker.update(buffer, 0, bytesRead);
        }
        long crc = crcMaker.getValue(); // This is your error checking code
        System.out.println("CRC code is " + crc);
    }
}
like image 234
Andrew Thompson Avatar asked Oct 15 '11 05:10

Andrew Thompson


People also ask

What is a verification hash?

Hash Verification The additional information is called a “hash” of the file and it is intended as an integrity check of the file to verify that the file has not been altered or tampered with. For this purpose, it is important that the used function is cryptographically secure, i.e. cannot simply be reversed.

What is checksum vs hash?

A checksum is intended to verify (check) the integrity of data and identify data-transmission errors, while a hash is designed to create a unique digital fingerprint of the data. A checksum protects against accidental changes. A cryptographic hash protects against a very motivated attacker.

Why do we verify the hash value of a file?

A hash value remains unchanged from the time it is created and is considered an "electronic fingerprint" of a file. A cryptographic checksum is assigned to a file and is used to verify that the data in that file has not been tampered with or manipulated, possibly by a malicious entity.


3 Answers

Arrays.hashCode() is designed to be very fast (used in hash tables). I highly recommend not using it for this purpose.

What you want is some sort of error-checking code like a CRC.

Java happens to have a class for calculating these: CRC32:

InputStream in = ...;
CRC32 crcMaker = new CRC32();
byte[] buffer = new byte[someSize];
int bytesRead;
while((bytesRead = in.read(buffer)) != -1) {
    crcMaker.update(buffer, 0, bytesRead);
}
long crc = crcMaker.getValue(); // This is your error checking code
like image 60
Brendan Long Avatar answered Sep 22 '22 16:09

Brendan Long


Here is an example:

You need to create a checksum file
http://www.jguru.com/faq/view.jsp?EID=216274

    FileInputStream file = new FileInputStream(args[0]);
    CheckedInputStream check = 
      new CheckedInputStream(file, new CRC32());
    BufferedInputStream in = 
      new BufferedInputStream(check);
    while (in.read() != -1) {
        // Read file in completely
    }
    in.close();
    System.out.println("Checksum is " + 
      check.getChecksum().getValue());
like image 45
DarthVader Avatar answered Sep 23 '22 16:09

DarthVader


Yes, as long as you load the entire file and pass it in, it will perform as expected. However it will consume as much RAM as the file is big, which is not necessary for this task. If you instead hash the file in smaller blocks as you stream it from storage, then you can avoid wasting memory. You could, for example, xor together the hashes of each block to create a final hash, or find a hash implementation that expects data to be streamed.

like image 28
VoidStar Avatar answered Sep 23 '22 16:09

VoidStar