Combining two samples out of numpy.random does not end up in a random sequence

Question

I implemented the Wald-Wolfowitz runs test but during testing I encountered weird behaviour, the steps I take are the following:

I take two samples out of the same distribution:

import numpy as np
list_dist_A = np.random.chisquare(2, 1000)
list_dist_B = np.random.chisquare(2, 1000)

I concatenate the two lists and sort them, while remembering which number came from which sample. The following function does that and it returns a list of labels ["A","B","A","A", ... "B"]

def _get_runs_list(list1, list2):
     # Add labels  
     l1 = list(map(lambda x: (x, "A"), list1))
     l2 = list(map(lambda x: (x, "B"), list2))
     # Concatenate
     lst = l1 + l2
     # Sort
     sorted_list = sorted(lst, key=lambda x: x[0])
     # Return only the labels:
     return [l[1] for l in sorted_list]

Now I want to calculate the number of runs (a consecutive sequence of identical labels). e.g.:

a,b,a,b has 4 runs
a,a,a,b,b has 2 runs
a,b,b,b,a,a has 3 runs

For this I use the following code:

def _calculate_nruns(labels):
    nruns = 0
    last_seen = None

    for label in labels:
        if label != last_seen:
            nruns += 1
        last_seen = label

    return nruns

Since all elements are randomly drawn I thought that I should roughly end up with a sequence a,b,a,b,a,b... So this would mean that the number of runs is roughly 2000. However as can be seen in this snippet on "repl.it" this is not the case, it is always roughly around 1000.

Can anyone explain why this is the case?

Robert Kern · Accepted Answer

~1000 is the expected result. Following the Wikipedia article on this statistical test, you have Np = Nn = 1000 and N = Np + Nn = 2000. That means that the expected value for the number of runs is mu = 2 * Np * Nn / N + 1 which is 1001.

MSeifert · Answer

Oh, that reminds me of the Gambler's fallacy.

I'm not a statistician but to get 2000 runs you would need a 100% chance that A follows B and B follows A. This would indicate that the PRNG has some sort of memory of previous draws. That wouldn't be good...

OTOH, assume you have drawn a value labelled A, then there's a 50% chance to draw another A and a 50% chance to draw a B. So the chance to draw a length-one-run is actually only 50%, the chance to get a length-two run is 25%, for length-three it's 12.5%, for length-four it's 6.25 and so on.

The last part can easily be verified:

import numpy as np
list_dist_A = np.random.chisquare(2, 1000)
list_dist_B = np.random.chisquare(2, 1000)

listA = [(value, 'A') for value in list_dist_A]
listB = [(value, 'B') for value in list_dist_B]
combined = sorted(listA+listB, key=lambda x: x[0])
combined = [x[1] for x in combined]

from itertools import groupby
from collections import Counter

runlengths = [len(list(it)) for _, it in groupby(combined)]  # lengths of the individual runs
print(Counter(runlengths))  # similar to a histogram
# Counter({1: 497, 2: 234, 3: 131, 4: 65, 5: 29, 6: 20, 7: 11, 8: 2, 10: 1, 14: 1})

So this is actually quite close to the expectation (which would be: 1: 500, 2: 250, 3: 125, 4:62, ... as mentioned above). If your assumption would be correct it would be more close to 1:2000, 2: 0, ...

Combining two samples out of numpy.random does not end up in a random sequence

Tags:

python

random

warreee

2 Answers

Robert Kern

MSeifert

Recent Activity

Donate For Us

Combining two samples out of numpy.random does not end up in a random sequence

Tags:

python

random

warreee

2 Answers

Robert Kern

MSeifert

Related questions

Recent Activity

Donate For Us