How likely is it for a collision to occur with tokens generated using Python's secrets library (https://docs.python.org/3/library/secrets.html)?
There doesn't seem to be any mention of their uniqueness.
The purpose of the secrets
module, is for sourcing secret data, i.e. information that cannot be predicted or reverse engineered
The secrets module provides access to the most secure source of randomness that your operating system provides.
Typically, the OS will use several entropy sources to generate bits. e.g. process id, thread id, mouse / keyboard input timings, CPU counters, system time, etc.
So long as the amount of entropy is sufficiently large for the amount of bits being generated (and again the OS uses many sources and accumulates entropy constantly), we should expect a uniform distribution of all values. So if you generate a 32 bit key, you should expect to see each of the 4294967296 values with similar frequency.
To estimate how long before we expect a collision, we are essentially looking at the Birthday problem. In general, if the values are uniformly distributed, and the number of values is n
, we should expect a collision to occur after generating about sqrt(n)
values (though in reality it's a bit more).
You can verify this with a quick benchmark program
import secrets
def find_collision(b):
tokens = set()
while True:
token = secrets.token_bytes(b)
if token in tokens:
return len(tokens)
tokens.add(token)
b = 4
samples = 100
l = [find_collision(b) for i in range(samples)]
avg = sum(l)/len(l)
print(f'n = {2**(b*8)}, sqrt(n) = {2**(b*8/2)}')
print(f'on average, with a {b} byte token, a collision occurs after {avg} values')
n = 4294967296, sqrt(n) = 65536.0
on average, with a 4 byte token, a collision occurs after 75797.78 values
secrets
makes no aims or claims about uniqueness, as the goal is to generate high entropy, random bytes that should be impossible to predict or reverse engineer. To add constraints to the module to prevent duplicates would inherently make it more predictable. In addition, secrets
feeds you these bytes in a stream under the assumption that you take what you need, and use it how you like, so preventing duplicates doesn't make much sense as an upstream responsibility. This is in contrast to something like a UUID, which has a fixed width and is intended to be used as a portable identifier and recognized datatype.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With