Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the CRC32 function work when using sampling data?

Tags:

python

hex

hash

I would like to ask you about explanation of the following short function in Python..

from zlib import crc32

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

The above-mentioned function should be the same as the following function:

import hashlib

def test_set_check(identifier, test_ratio, hash=hashlib.md5):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

Both functions should be used for data sampling (select some rows in a table). For example, if test_ratio is 0.2 then it means that I want to sample 20% data, the value is lower or equal to 51 (~20% of 256). I understand how the second function works but I don't understand the first one. Could you please explain to me the first function? I don't understand the following part: crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

like image 631
Andrew_256 Avatar asked Jun 01 '18 15:06

Andrew_256


1 Answers

The crc32 function outputs an unsigned 32-bit number, and the code tests if the CRC value is lower than the test_ratio times the maximum 32-bit number.

The & 0xffffffff mask is there only to ensure compatibility with Python 2 and 3. In Python 2 the same function could return a signed integer, in a range from -(2^31) to (2^31) - 1, masking this with the 0xffffffff mask normalises the value to a signed.

So basically, either version turns the identifier into an integer, and the hash is used to make that integer reasonably uniformly distributed in a range; for the MD5 hash that's the last byte making the value fall between 0 and 255, for the CRC32 checksum the value lies between 0 and (2^32)-1. This integer is then compared to the full range; if it falls below the test_ratio * maximum cut-off point it is considered selected.

You could also use a random function, but then you'd get a different subset of your input each time you picked a sample; by hashing the identifier you get to produce a consistent subset. The difference between the two methods is that they'll produce a different subset, so you could use both together to pick multiple, independent subsets from the same input.

Compare:

>>> import numpy as np
>>> from zlib import crc32
>>> from hashlib import md5
>>> import random
>>> identifier = np.int64(random.randrange(2**63))
>>> md5(identifier).digest()[-1]
243
>>> md5(identifier).digest()[-1] / 256  # as a ratio of the full range
0.94921875
>>> crc32(identifier)
4276259108
>>> crc32(identifier) / (2 ** 32)   # ratio again
0.9956441605463624
>>> identifier = np.int64(random.randrange(2**63))  # different id to compare
>>> md5(identifier).digest()[-1] / 256  # as a ratio of the full range
0.83203125
>>> crc32(identifier) / (2 ** 32)   # ratio again
0.10733163682743907

So the two different methods produce different outputs, but as long as the CRC32 and MD5 hashes produce reasonably uniformly distributed hash values, then either will give you a fair 20% sampling rate.

like image 190
Martijn Pieters Avatar answered Nov 15 '22 21:11

Martijn Pieters