Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up my indexes in MySQL - CRC or MD5?

Tags:

php

mysql

crc32

I've got a huge table with something like 8 300 000 rows (won't be edit nor delete ever).

My first column look something similar P300-4312B_X16_S and the entry isn't unique so I use a regular INDEX on this field.

However, MySQL is WAY faster using a binary field instead of a varchar so I encode my INDEX in MD5 using BINARY(16) to store the data.

This morning, I've started to use CRC32 for the first time and I saw that CRC32 can be output as an hexadecimal string using 8 characters.

My question : If I use a CRC32 instead of a MD5, it will be faster. However, when CRC32 is ran over let's say 2 000 000 unique value, the result will be unique or maybe sometime I'll have twice the same string for two differents string ? I ask that because the result is only 8 characters (32b) long instead of 32(128b) like the MD5.

Thanks.

like image 360
David Bélanger Avatar asked Oct 01 '12 18:10

David Bélanger


1 Answers

The expected number of collisions is the number of pairs over the number of possible check values. So for 2,000,000 values there are (2000000 * 1999999) / 2 pairs, which is about 2x1012. For a 32-bit CRC, the expected number of collisions is that over 232, which is 466. So you are essentially guaranteed to have collisions in that case.

For a 128-bit MD5 check value, the expected number of collisions is about 6x10-27. For small values of the expected number, that is also the probability of one collision.

If it is important to you to have a very low probability of a collision, then you need to pick something other than a CRC-32.

You don't need the overhead of MD5 though, where its cryptographic strength is unimportant for your application. You don't really care if someone malicious can find a way to fabricate an entry with the same check value as another entry. Therefore you could use a 64-bit non-cryptographic hash designed for that purpose, which would run much faster and would give a 10-7 probability of a collision in your case of 2,000,000 values. Or you could use a 128-bit non-cryptographic hash and get the same probability as for MD5, but much faster. Take a look at the CityHash family of hash algorithms.

Note however that in all cases the probability of a collision is not zero. You should consider the consequences of a collision to your code.

like image 173
Mark Adler Avatar answered Nov 10 '22 10:11

Mark Adler