Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert integer to a random but deterministically repeatable choice

How do I convert an unsigned integer (representing a user ID) to a random looking but actually a deterministically repeatable choice? The choice must be selected with equal probability (irrespective of the distribution of the the input integers). For example, if I have 3 choices, i.e. [0, 1, 2], the user ID 123 may always be randomly assigned choice 2, whereas the user ID 234 may always be assigned choice 1.

Cross-language and cross-platform algorithmic reproducibility is desirable. I'm inclined to use a hash function and modulo unless there is a better way. Here is what I have:

>>> num_choices = 3
>>> id_num = 123
>>> int(hashlib.sha256(str(id_num).encode()).hexdigest(), 16) % num_choices
2

I'm using the latest stable Python 3. Please note that this question is similar but not exactly identical to the related question to convert a string to random but deterministically repeatable uniform probability.

like image 749
Asclepius Avatar asked Dec 21 '16 05:12

Asclepius


1 Answers

Using hash and modulo

import hashlib

def id_to_choice(id_num, num_choices):
    id_bytes = id_num.to_bytes((id_num.bit_length() + 7) // 8, 'big')
    id_hash = hashlib.sha512(id_bytes)
    id_hash_int = int.from_bytes(id_hash.digest(), 'big')  # Uses explicit byteorder for system-agnostic reproducibility
    choice = id_hash_int % num_choices  # Use with small num_choices only
    return choice

>>> id_to_choice(123, 3)
0
>>> id_to_choice(456, 3)
1

Notes:

  • The built-in hash method must not be used because it can preserve the input's distribution, e.g. with hash(123). Alternatively, it can return values that differ when Python is restarted, e.g. with hash('123').

  • For converting an int to bytes, bytes(id_num) works but is grossly inefficient as it returns an array of null bytes, and so it must not be used. Using int.to_bytes is better. Using str(id_num).encode() works but wastes a few bytes.

  • Admittedly, using modulo doesn't offer exactly uniform probability,[1][2] but this shouldn't bias much for this application because id_hash_int is expected to be very large and num_choices is assumed to be small.

Using random

The random module can be used with id_num as its seed, while addressing concerns surrounding both thread safety and continuity. Using randrange in this manner is comparable to and simpler than hashing the seed and taking modulo.

With this approach, not only is cross-language reproducibility a concern, but reproducibility across multiple future versions of Python could also be a concern. It is therefore not recommended.

import random

def id_to_choice(id_num, num_choices):
    localrandom = random.Random(id_num)
    choice = localrandom.randrange(num_choices)
    return choice

>>> id_to_choice(123, 3)
0
>>> id_to_choice(456, 3)
2
like image 121
Asclepius Avatar answered Nov 16 '22 09:11

Asclepius