I'm relatively new to python and wanted to test myself, by tackling the birthday problem. Rather than calculating it mathematically, I wanted to simulate it, to see if I would get the right answer. So I assigned all boolean values in the list sieve[] as False and then randomly pick a value from 0 to 364 and change it to True, if it's already True then it outputs how many times it had to iterate as an answer.
For some reason, every time I run the code, I get a value between 24.5 and 24.8
The expected result for 50% is 23 people, so why is my result 6% higher than it should be? Is there an error in my code?
import random
def howManyPeople():
sieve = [False] * 365
count = 1
while True:
newBirthday = random.randint(0,364)
if sieve[newBirthday]:
return count
else:
sieve[newBirthday] = True
count += 1
def multipleRun():
global timesToRun
results = []
for i in range(timesToRun):
results.append(howManyPeople())
finalResultAverage = sum(results)
return (finalResultAverage / timesToRun)
timesToRun = int(input("How many times would you like to run this code?"))
print("Average of all solutions = " + str(multipleRun()) + " people")
The birthday paradox is strange, counter-intuitive, and completely true. It's only a “paradox” because our brains can't handle the compounding power of exponents. We expect probabilities to be linear and only consider the scenarios we're involved in (both faulty assumptions, by the way).
Due to probability, sometimes an event is more likely to occur than we believe it to. In this case, if you survey a random group of just 23 people there is actually about a 50–50 chance that two of them will have the same birthday. This is known as the birthday paradox.
So the chance that two people don't share a birthday is (365×364)/365². Subtract that from 1 and you get what you expect: that there's a 1 in 365 chance that two people share a birthday.
The chance of: two people sharing a birthday would be 1 - (364/365), or 0.3%, or 1 in 370.
There's no error in your code. You're computing the mean of your sample of howManyPeople
return values, when what you're really interested in (and what the birthday paradox tells you about) is the median of the distribution.
That is, you've got a random process where you incrementally add people to a set, then report the total number of people in that set on the first birthday collision. The birthday paradox implies that at least 50% of the time, your set will have 23 or fewer people. That's not the same thing as saying the expected number of people in the set is 23.0 or smaller.
Here's what I see from one million samples of your howManyPeople
function.
In [4]: sample = [howManyPeople() for _ in range(1000000)]
In [5]: import numpy as np
In [6]: np.median(sample)
Out[6]: 23.0
In [7]: np.mean(sample)
Out[7]: 24.617082
In [8]: np.mean([x <= 23 for x in sample])
Out[8]: 0.506978
Note that there's a (tiny) amount of luck here: the median of the distribution of howManyPeople
return values is 23
(at least according to Wikipedia's definition), but there's a chance that an unusual sample could have different median, purely through randomness. In this particular case, that chance is entirely negligible. And as user2357112 points out in comments, things are a bit messier in the 2-day year example, where any real number between 2.0
and 3.0
(inclusive) is a valid distribution median, and we could reasonably expect a sample median to be either 2
or 3
.
Instead of sampling, we can also compute the probabilities of each output of howManyPeople
directly: for any positive integer k
, the probability that the output is strictly larger than k
is the same as the probability that the first k
people have distinct birthdays, which is given (in Python syntax) by factorial(365)/factorial(k)/365**k
, and we can use that to compute the probabilities of individual outputs. Here I'm using the name X
for the random variable represented by howManyPeople
. Some inefficient code:
from math import factorial
def prob_X_greater_than(k):
"""Probability that the output of howManyPeople is > k."""
if k <= 0:
return 1.0
elif k > 365:
return 0.0
else:
return factorial(365) / factorial(365 - k) / 365**k
def prob_X_equals(k):
"""Probability that the output of howManyPeople is == k."""
return prob_x_greater_than(k-1) - prob_x_greater_than(k)
With this, we can get the exact (well, okay, exact up to numerical errors) mean and verify that it roughly matches what we got from the sample:
In [18]: sum(k*prob_x_equals(k) for k in range(1, 366))
Out[18]: 24.616585894598863
And the birthday paradox in this case should tell us that the sum of the probabilities for k <= 23
is greater than 0.5
:
In [19]: sum(prob_x_equals(k) for k in range(1, 24))
Out[19]: 0.5072972343239854
What you're seeing is normal. There may be a >50% chance of having a duplicate birthday in a room of 23 random people (ignoring leap years and nonuniform birthday distributions), but that doesn't mean that if you add people to a room one by one, the mean point at which you get a duplicate will be 23.
To get an intuitive feel for this, imagine if years only had two days. In this case, it's clear that there's a 50% chance of having a duplicate birthday in a room with 2 people. However, if you add random people to the room one by one, you're going to need at least two people - 50% chance of stopping at 2 and 50% of 3. The mean stopping point is 2.5, not 2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With