Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How likely is token collision with Python secrets library?

How likely is it for a collision to occur with tokens generated using Python's secrets library (https://docs.python.org/3/library/secrets.html)?

There doesn't seem to be any mention of their uniqueness.

like image 837
daka Avatar asked Jan 25 '23 06:01

daka


1 Answers

The purpose of the secrets module, is for sourcing secret data, i.e. information that cannot be predicted or reverse engineered

The secrets module provides access to the most secure source of randomness that your operating system provides.

Typically, the OS will use several entropy sources to generate bits. e.g. process id, thread id, mouse / keyboard input timings, CPU counters, system time, etc.

So long as the amount of entropy is sufficiently large for the amount of bits being generated (and again the OS uses many sources and accumulates entropy constantly), we should expect a uniform distribution of all values. So if you generate a 32 bit key, you should expect to see each of the 4294967296 values with similar frequency.

To estimate how long before we expect a collision, we are essentially looking at the Birthday problem. In general, if the values are uniformly distributed, and the number of values is n, we should expect a collision to occur after generating about sqrt(n) values (though in reality it's a bit more).

You can verify this with a quick benchmark program

import secrets

def find_collision(b):
    tokens = set()
    while True:
        token = secrets.token_bytes(b)
        if token in tokens:
            return len(tokens)
        tokens.add(token)

b = 4
samples = 100
l = [find_collision(b) for i in range(samples)]
avg = sum(l)/len(l)

print(f'n = {2**(b*8)}, sqrt(n) = {2**(b*8/2)}')
print(f'on average, with a {b} byte token, a collision occurs after {avg} values')
n = 4294967296, sqrt(n) = 65536.0
on average, with a 4 byte token, a collision occurs after 75797.78 values

secrets makes no aims or claims about uniqueness, as the goal is to generate high entropy, random bytes that should be impossible to predict or reverse engineer. To add constraints to the module to prevent duplicates would inherently make it more predictable. In addition, secrets feeds you these bytes in a stream under the assumption that you take what you need, and use it how you like, so preventing duplicates doesn't make much sense as an upstream responsibility. This is in contrast to something like a UUID, which has a fixed width and is intended to be used as a portable identifier and recognized datatype.

like image 142
Hymns For Disco Avatar answered Jan 27 '23 20:01

Hymns For Disco