Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create an aggregate checksum of a column

I want to compute a checksum of all of the values of a column in aggregate.

In other words, I want to do some equivalent of

md5(group_concat(some_column))

The problem with this approach is:

  1. It's inefficient. It has to concat all of the values of the column as a string in some temporary storage before passing it to the md5 function
  2. group_concat has a max length of 1024, after which everything else will be truncated.

(In case you're wondering, you can ensure that the concat of the values is in a consistent order, however, as believe it or not group_concat() accepts an order by clause within it, e.g. group_concat(some_column order by some_column))

MySQL offers the nonstandard bitwise aggregate functions BIT_AND(), BIT_OR() and BIT_XOR() which I presume would be useful for this problem. The column is numeric in this case but I would be interested to know if there was a way to do it with string columns.

For this particular application, the checksum does not have to be cryptologically safe.

like image 500
ʞɔıu Avatar asked Feb 26 '09 16:02

ʞɔıu


2 Answers

It seems like you might as well use crc32 instead of md5 if you don't care about cryptographic strength. I think this:

select sum(crc32(some_column)) from some_table;

would work on strings. It might be inefficient as perhaps MySQL would create a temporary table (especially if you added an order by).

like image 162
Jacob Gabrielson Avatar answered Oct 19 '22 22:10

Jacob Gabrielson


The following query is used in Percona's Mysql Table Checksumming tool. Its a little tough to understand, but essentially it CRC32s the column (or a bunch of columns concatted) for every row, then XORs them all together using the BIT_XOR group function. If one crc hash is different, the result of XORing everything will also be different. This happens in fixed memory, so you can checksum arbitrarily large tables.

SELECT CONV(BIT_XOR(CAST(CRC32(column) AS UNSIGNED)), 10, 16)

One thing to keep in mind though that this does not prevent possible collisions, and CRC32 is a pretty weak function by today's standards. A nicer hashing function would be something like the FNV_64. It would be very unlikely to have two hashes which complement each other when XORed together.

like image 31
Yarek T Avatar answered Oct 19 '22 22:10

Yarek T