I've got a huge table with something like 8 300 000 rows (won't be edit nor delete ever).
My first column look something similar P300-4312B_X16_S
and the entry isn't unique so I use a regular INDEX on this field.
However, MySQL is WAY faster using a binary field instead of a varchar so I encode my INDEX in MD5 using BINARY(16)
to store the data.
This morning, I've started to use CRC32 for the first time and I saw that CRC32 can be output as an hexadecimal string using 8 characters.
My question : If I use a CRC32 instead of a MD5, it will be faster. However, when CRC32 is ran over let's say 2 000 000 unique value, the result will be unique or maybe sometime I'll have twice the same string for two differents string ? I ask that because the result is only 8 characters (32b) long instead of 32(128b) like the MD5.
Thanks.
The expected number of collisions is the number of pairs over the number of possible check values. So for 2,000,000 values there are (2000000 * 1999999) / 2 pairs, which is about 2x1012. For a 32-bit CRC, the expected number of collisions is that over 232, which is 466. So you are essentially guaranteed to have collisions in that case.
For a 128-bit MD5 check value, the expected number of collisions is about 6x10-27. For small values of the expected number, that is also the probability of one collision.
If it is important to you to have a very low probability of a collision, then you need to pick something other than a CRC-32.
You don't need the overhead of MD5 though, where its cryptographic strength is unimportant for your application. You don't really care if someone malicious can find a way to fabricate an entry with the same check value as another entry. Therefore you could use a 64-bit non-cryptographic hash designed for that purpose, which would run much faster and would give a 10-7 probability of a collision in your case of 2,000,000 values. Or you could use a 128-bit non-cryptographic hash and get the same probability as for MD5, but much faster. Take a look at the CityHash family of hash algorithms.
Note however that in all cases the probability of a collision is not zero. You should consider the consequences of a collision to your code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With