Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python random.randint vs random.choice: different outcomes usingsame values

Tags:

python

random

I asked my students to write a python program in which the results of 100 rolls of a pair of 6-sided dice were stored in a list and then plotted in a histogram.

I was treating random.choice(1,2,3,4,5,6) as inferior to random.randint(1,6), until I noticed that the histograms of the students who used random.choice better reflected expected outcomes. For example, the occurrence of rolls of 12 (6+6) was unnaturally high in nearly all histograms of students who used random.randint(1,6). Does anyone have an idea of what's going on?

like image 492
N. Duncan Avatar asked Mar 12 '18 20:03

N. Duncan


2 Answers

From the documentation:

Almost all module functions depend on the basic function random(), which generates a random float uniformly in the semi-open range [0.0, 1.0). Python uses the Mersenne Twister as the core generator. It produces 53-bit precision floats and has a period of 2**19937-1. The underlying implementation in C is both fast and threadsafe. The Mersenne Twister is one of the most extensively tested random number generators in existence. However, being completely deterministic, it is not suitable for all purposes, and is completely unsuitable for cryptographic purposes.

So there shouldn't be any real difference in results. However, I would disagree that random.choice() is inferior to randint(), in fact, random choice is actually faster at generating random numbers. When you look at the source code:

def randint(self, a, b):
    return self.randrange(a, b+1)

def randrange(self, start, stop=None, step=1, _int=int, _maxwidth=1L<<BPF):
    istart = _int(start)
    if istart != start:
        # not executed
    if stop is None:
        # not executed

    istop = _int(stop)
    if istop != stop:
        # not executed
    width = istop - istart
    if step == 1 and width > 0:
        if width >= _maxwidth:
            # not executed
        return _int(istart + _int(self.random()*width))

And for choice():

def choice(self, seq):
    return seq[int(self.random() * len(seq))]

You can see that randint() has the additional overhead of using randrange()

EDIT As @abarnert has noted in the comments, there really is almost no performance difference here, and randint(1,6) is a clear and intuitive way of representing a dice roll

I ran both for 10000 rolls, and didn't see any skewing, so there is a chance your input samples were just too small:

enter image description here

And here is a distribution for rolling one dice twice, it is also very uniform:

enter image description here

I borrowed pieces of this from these two helpful answers: Performance of choice vs randint Is Pythons random.randint statistically random?, which are helpful for further reading.

like image 92
user3483203 Avatar answered Oct 31 '22 04:10

user3483203


You are correct that the number of 12s you have been observing in your students' histograms is higher than the theoretical probability of rolling a 12, but not for the reason that you think.

An experiment:

import random

def roll_dice(method):
    if method == "choice":
        return random.choice([1,2,3,4,5,6]) + random.choice([1,2,3,4,5,6])
    else:
        return random.randint(1,6) + random.randint(1,6)

def est_prob(n,k,method):
    rolls = [roll_dice(method) for _ in range(k)]
    return rolls.count(n)/k

def test12(n,k,method):
    return sum(1 if est_prob(12,n,method) > 1/36 else 0 for _ in range(k))/k

Note that test12(100,10000,"randint") estimates the probability that a histogram of 100 dice rolls based on randint over-represents the sum of 12.

Typical run:

>>> test12(100,10000,"randint")
0.5288

This is greater than 50% by a statistically significant amount (10000 trials is a fairly large number of trials to estimate a probability).

So evidence of bias in randint(), no? Not so fast:

>>> test12(100,10000,"choice")
0.5342

With random.choice() you see the same thing. None of this is surprising since most dice roll histograms based on 100 rolls overestimate the probability of 12.

When you roll a pair of dice 100 times, the expected number of rolls which sum to 12 is 100/36 = 2.78. But -- you can only ever observe an integer number of 12s. The probability that the observed number of 12s is 3 or above (and hence leads to a histogram which over-represents 12) is P(X >= 3) where X is a binomial random variable with parameters p = 1/36 and n = 100. This probability can be worked out to be

P(X >= 3) = 1 - P(X<=2) 
          = 1 - P(0) - P(1) - P(2)
          = 1 - 0.0598 - 0.1708 - 0.2416
          = 0.5278

Thus around 53% of such histograms have "too many" 12s, something which you will see with both random.choice() and random.randint().

It seems that you noticed this phenomenon more in the context of randint, interpreted it as bias (even though it isn't), and hypothesized that it was a deficiency in randint.

like image 30
John Coleman Avatar answered Oct 31 '22 05:10

John Coleman