Birthday Paradox, incorrect output by about 1

Tags:

python

I'm relatively new to python and wanted to test myself, by tackling the birthday problem. Rather than calculating it mathematically, I wanted to simulate it, to see if I would get the right answer. So I assigned all boolean values in the list sieve[] as False and then randomly pick a value from 0 to 364 and change it to True, if it's already True then it outputs how many times it had to iterate as an answer.

For some reason, every time I run the code, I get a value between 24.5 and 24.8

The expected result for 50% is 23 people, so why is my result 6% higher than it should be? Is there an error in my code?

import random

def howManyPeople():
    sieve = [False] * 365
    count = 1
    while True:
        newBirthday = random.randint(0,364)
        if sieve[newBirthday]:
            return count
        else:
            sieve[newBirthday] = True
            count += 1

def multipleRun():
    global timesToRun
    results = []
    for i in range(timesToRun):
        results.append(howManyPeople())
    finalResultAverage = sum(results)
    return (finalResultAverage / timesToRun)

timesToRun = int(input("How many times would you like to run this code?"))

print("Average of all solutions = " + str(multipleRun()) + " people")

943

asked Sep 07 '18 17:09

Evan Lewis

2 Answers

There's no error in your code. You're computing the mean of your sample of howManyPeople return values, when what you're really interested in (and what the birthday paradox tells you about) is the median of the distribution.

That is, you've got a random process where you incrementally add people to a set, then report the total number of people in that set on the first birthday collision. The birthday paradox implies that at least 50% of the time, your set will have 23 or fewer people. That's not the same thing as saying the expected number of people in the set is 23.0 or smaller.

Here's what I see from one million samples of your howManyPeople function.

In [4]: sample = [howManyPeople() for _ in range(1000000)]

In [5]: import numpy as np

In [6]: np.median(sample)
Out[6]: 23.0

In [7]: np.mean(sample)
Out[7]: 24.617082

In [8]: np.mean([x <= 23 for x in sample])
Out[8]: 0.506978

Note that there's a (tiny) amount of luck here: the median of the distribution of howManyPeople return values is 23 (at least according to Wikipedia's definition), but there's a chance that an unusual sample could have different median, purely through randomness. In this particular case, that chance is entirely negligible. And as user2357112 points out in comments, things are a bit messier in the 2-day year example, where any real number between 2.0 and 3.0 (inclusive) is a valid distribution median, and we could reasonably expect a sample median to be either 2 or 3.

Instead of sampling, we can also compute the probabilities of each output of howManyPeople directly: for any positive integer k, the probability that the output is strictly larger than k is the same as the probability that the first k people have distinct birthdays, which is given (in Python syntax) by factorial(365)/factorial(k)/365**k, and we can use that to compute the probabilities of individual outputs. Here I'm using the name X for the random variable represented by howManyPeople. Some inefficient code:

from math import factorial

def prob_X_greater_than(k):
    """Probability that the output of howManyPeople is > k."""
    if k <= 0:
        return 1.0
    elif k > 365:
        return 0.0
    else:
        return factorial(365) / factorial(365 - k) / 365**k

def prob_X_equals(k):
    """Probability that the output of howManyPeople is == k."""
    return prob_x_greater_than(k-1) - prob_x_greater_than(k)

With this, we can get the exact (well, okay, exact up to numerical errors) mean and verify that it roughly matches what we got from the sample:

In [18]: sum(k*prob_x_equals(k) for k in range(1, 366))
Out[18]: 24.616585894598863

And the birthday paradox in this case should tell us that the sum of the probabilities for k <= 23 is greater than 0.5:

In [19]: sum(prob_x_equals(k) for k in range(1, 24))
Out[19]: 0.5072972343239854

answered Oct 03 '22 03:10

Mark Dickinson

What you're seeing is normal. There may be a >50% chance of having a duplicate birthday in a room of 23 random people (ignoring leap years and nonuniform birthday distributions), but that doesn't mean that if you add people to a room one by one, the mean point at which you get a duplicate will be 23.

To get an intuitive feel for this, imagine if years only had two days. In this case, it's clear that there's a 50% chance of having a duplicate birthday in a room with 2 people. However, if you add random people to the room one by one, you're going to need at least two people - 50% chance of stopping at 2 and 50% of 3. The mean stopping point is 2.5, not 2.

answered Oct 03 '22 02:10

user2357112 supports Monica

Related questions
                            
                                Python (Pandas) Add subtotal on each lvl of multiindex dataframe
                            
                                pip install pickle not working - no such file or directory
                            
                                expanding a dataframe based on start and end columns (speed)
                            
                                How to remove the quotes from a string for SQL query in Python?
                            
                                Convert column values to lower case only if they are string
                            
                                How to remove all the values in a string except for the chosen ones [duplicate]
                            
                                json.loads() doesn't keep order [duplicate]
                            
                                Check if module is running in Jupyter or not
                            
                                Is there a way to delete all cells at once in jupyter?
                            
                                Python download youtube with specific filename
                            
                                Mask from max values in numpy array, specific axis
                            
                                How to delete a global variable from inside a function?
                            
                                Why do I get an AttributeError when using pandas apply?
                            
                                How to make a command case insensitive in discord.py
                            
                                Duplicating training examples to handle class imbalance in a pandas data frame
                            
                                WebRTC Python implementation
                            
                                Does applying a Dropout Layer after the Embedding Layer have the same effect as applying the dropout through the LSTM dropout parameter?
                            
                                The fastest way to exclude surrounding zeros from an array representing an image?
                            
                                How to convert a datetime format to minutes - pandas
                            
                                Flask SQLAlchemy - set expire_on_commit=False only for current session

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With