Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different text but same CRC checksum?

Tags:

java

crc32

My application use CRC32 to check two contents or two files are same or not. But when I try it use to generate unique id, I see the problem, with the two different string, the CRC32 can be same. Here is my Java code. Thanks in advance.

public static String getCRC32(String content) {
    byte[] bytes = content.getBytes();
    Checksum checksum = new CRC32();
    checksum.update(bytes, 0, bytes.length);            
    return String.valueOf(checksum.getValue());
}

public static void main(String[] args){
    System.out.println(getCRC32("b5a7b602ab754d7ab30fb42c4fb28d82"));
    System.out.println(getCRC32("d19f2e9e82d14b96be4fa12b8a27ee9f"));       
}
like image 699
Viet Avatar asked Nov 30 '22 18:11

Viet


2 Answers

Yes, that's what CRCs are like. They're not unique IDs. They're likely to be different for different inputs, but they don't have to be. After all, you're providing more than 32 bits of input, so you can't expect to have more than 232 different inputs to all produce different CRCs.

A longer cryptographic hash (e.g. SHA-256) is far more likely to give different outputs for different inputs, but it's still not impossible (and can't be, due to the amount of input data vs output data). The big difference between a CRC and a cryptographic hash is that a CRC is relatively easy to "steer" if you want to - it's not terribly hard to find collisions, and it's used to protect against accidental data corruption. Cryptographic hashes are designed to protect against deliberate data corruption by some attacker - so it's hard to deliberately create a value targeting a specific hash.

As an aside, your use of String.getBytes() without specifying a charset is problematic - it uses the platform-default encoding, so if you run the same code on two machines with the same input, you can get different results. I would strongly encourage you to use a fixed encoding (e.g. UTF-8).

like image 131
Jon Skeet Avatar answered Dec 07 '22 23:12

Jon Skeet


Yes, they can be the same, but that will occur accidentally with a very low probability of 2-32.

As Jon noted, you can construct strings with the same CRC deliberately. My spoof code automates that. Here is an example of another string with the same CRC as those presented in the problem, but with limited differences from the first string: b5a7b702ab643f7ac47fb57c4fb28b82, generated using spoof.

like image 34
Mark Adler Avatar answered Dec 08 '22 01:12

Mark Adler