I have a requirement to 'check the integrity' of the content of files. The files will be written to CD/DVD, which might be copied many times. The idea is to identify copies (after they are removed from Nero etc.) which copied correctly.
Am rather new to this, but a quick search suggests that Arrays.hashCode(byte[])
will fit the need. We can include a file on the disk that contains the result of that call for each resource of interest, then compare it to the byte[]
of the File
as read from disk when checked.
Do I understand the method correctly, is this a valid way to go about checking file content?
If not, suggestions as to search keywords or strategies/methods/classes would be appreciated.
Working code based on the answer of Brendan. It takes care of the problem identified by VoidStar (needing to hold the entire byte[]
in memory for getting the hash).
import java.io.File;
import java.io.FileInputStream;
import java.util.zip.CRC32;
class TestHash {
public static void main(String[] args) throws Exception {
File f = new File("TestHash.java");
FileInputStream fis = new FileInputStream(f);
CRC32 crcMaker = new CRC32();
byte[] buffer = new byte[65536];
int bytesRead;
while((bytesRead = fis.read(buffer)) != -1) {
crcMaker.update(buffer, 0, bytesRead);
}
long crc = crcMaker.getValue(); // This is your error checking code
System.out.println("CRC code is " + crc);
}
}
Hash Verification The additional information is called a “hash” of the file and it is intended as an integrity check of the file to verify that the file has not been altered or tampered with. For this purpose, it is important that the used function is cryptographically secure, i.e. cannot simply be reversed.
A checksum is intended to verify (check) the integrity of data and identify data-transmission errors, while a hash is designed to create a unique digital fingerprint of the data. A checksum protects against accidental changes. A cryptographic hash protects against a very motivated attacker.
A hash value remains unchanged from the time it is created and is considered an "electronic fingerprint" of a file. A cryptographic checksum is assigned to a file and is used to verify that the data in that file has not been tampered with or manipulated, possibly by a malicious entity.
Arrays.hashCode()
is designed to be very fast (used in hash tables). I highly recommend not using it for this purpose.
What you want is some sort of error-checking code like a CRC.
Java happens to have a class for calculating these: CRC32:
InputStream in = ...;
CRC32 crcMaker = new CRC32();
byte[] buffer = new byte[someSize];
int bytesRead;
while((bytesRead = in.read(buffer)) != -1) {
crcMaker.update(buffer, 0, bytesRead);
}
long crc = crcMaker.getValue(); // This is your error checking code
Here is an example:
You need to create a checksum file
http://www.jguru.com/faq/view.jsp?EID=216274
FileInputStream file = new FileInputStream(args[0]);
CheckedInputStream check =
new CheckedInputStream(file, new CRC32());
BufferedInputStream in =
new BufferedInputStream(check);
while (in.read() != -1) {
// Read file in completely
}
in.close();
System.out.println("Checksum is " +
check.getChecksum().getValue());
Yes, as long as you load the entire file and pass it in, it will perform as expected. However it will consume as much RAM as the file is big, which is not necessary for this task. If you instead hash the file in smaller blocks as you stream it from storage, then you can avoid wasting memory. You could, for example, xor together the hashes of each block to create a final hash, or find a hash implementation that expects data to be streamed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With