I would like to ask you about explanation of the following short function in Python..
from zlib import crc32
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
The above-mentioned function should be the same as the following function:
import hashlib
def test_set_check(identifier, test_ratio, hash=hashlib.md5):
return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio
Both functions should be used for data sampling (select some rows in a table). For example, if test_ratio
is 0.2 then it means that I want to sample 20% data, the value is lower or equal to 51 (~20% of 256). I understand how the second function works but I don't understand the first one. Could you please explain to me the first function? I don't understand the following part: crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
The crc32
function outputs an unsigned 32-bit number, and the code tests if the CRC value is lower than the test_ratio times the maximum 32-bit number.
The & 0xffffffff
mask is there only to ensure compatibility with Python 2 and 3. In Python 2 the same function could return a signed integer, in a range from -(2^31) to (2^31) - 1, masking this with the 0xffffffff
mask normalises the value to a signed.
So basically, either version turns the identifier into an integer, and the hash is used to make that integer reasonably uniformly distributed in a range; for the MD5 hash that's the last byte making the value fall between 0 and 255, for the CRC32 checksum the value lies between 0 and (2^32)-1. This integer is then compared to the full range; if it falls below the test_ratio * maximum
cut-off point it is considered selected.
You could also use a random function, but then you'd get a different subset of your input each time you picked a sample; by hashing the identifier you get to produce a consistent subset. The difference between the two methods is that they'll produce a different subset, so you could use both together to pick multiple, independent subsets from the same input.
Compare:
>>> import numpy as np
>>> from zlib import crc32
>>> from hashlib import md5
>>> import random
>>> identifier = np.int64(random.randrange(2**63))
>>> md5(identifier).digest()[-1]
243
>>> md5(identifier).digest()[-1] / 256 # as a ratio of the full range
0.94921875
>>> crc32(identifier)
4276259108
>>> crc32(identifier) / (2 ** 32) # ratio again
0.9956441605463624
>>> identifier = np.int64(random.randrange(2**63)) # different id to compare
>>> md5(identifier).digest()[-1] / 256 # as a ratio of the full range
0.83203125
>>> crc32(identifier) / (2 ** 32) # ratio again
0.10733163682743907
So the two different methods produce different outputs, but as long as the CRC32 and MD5 hashes produce reasonably uniformly distributed hash values, then either will give you a fair 20% sampling rate.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With