How does the CRC32 function work when using sampling data?

Question

I would like to ask you about explanation of the following short function in Python..

from zlib import crc32

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

The above-mentioned function should be the same as the following function:

import hashlib

def test_set_check(identifier, test_ratio, hash=hashlib.md5):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

Both functions should be used for data sampling (select some rows in a table). For example, if test_ratio is 0.2 then it means that I want to sample 20% data, the value is lower or equal to 51 (~20% of 256). I understand how the second function works but I don't understand the first one. Could you please explain to me the first function? I don't understand the following part: crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

Martijn Pieters · Accepted Answer

The crc32 function outputs an unsigned 32-bit number, and the code tests if the CRC value is lower than the test_ratio times the maximum 32-bit number.

The & 0xffffffff mask is there only to ensure compatibility with Python 2 and 3. In Python 2 the same function could return a signed integer, in a range from -(2^31) to (2^31) - 1, masking this with the 0xffffffff mask normalises the value to a signed.

So basically, either version turns the identifier into an integer, and the hash is used to make that integer reasonably uniformly distributed in a range; for the MD5 hash that's the last byte making the value fall between 0 and 255, for the CRC32 checksum the value lies between 0 and (2^32)-1. This integer is then compared to the full range; if it falls below the test_ratio * maximum cut-off point it is considered selected.

You could also use a random function, but then you'd get a different subset of your input each time you picked a sample; by hashing the identifier you get to produce a consistent subset. The difference between the two methods is that they'll produce a different subset, so you could use both together to pick multiple, independent subsets from the same input.

Compare:

>>> import numpy as np
>>> from zlib import crc32
>>> from hashlib import md5
>>> import random
>>> identifier = np.int64(random.randrange(2**63))
>>> md5(identifier).digest()[-1]
243
>>> md5(identifier).digest()[-1] / 256  # as a ratio of the full range
0.94921875
>>> crc32(identifier)
4276259108
>>> crc32(identifier) / (2 ** 32)   # ratio again
0.9956441605463624
>>> identifier = np.int64(random.randrange(2**63))  # different id to compare
>>> md5(identifier).digest()[-1] / 256  # as a ratio of the full range
0.83203125
>>> crc32(identifier) / (2 ** 32)   # ratio again
0.10733163682743907

So the two different methods produce different outputs, but as long as the CRC32 and MD5 hashes produce reasonably uniformly distributed hash values, then either will give you a fair 20% sampling rate.

How does the CRC32 function work when using sampling data?

Tags:

python

hex

hash

Andrew_256

1 Answers

Martijn Pieters

Recent Activity

Donate For Us

How does the CRC32 function work when using sampling data?

Tags:

python

hex

hash

Andrew_256

1 Answers

Martijn Pieters

Related questions

Recent Activity

Donate For Us