Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining two samples out of numpy.random does not end up in a random sequence

Tags:

python

random

I implemented the Wald-Wolfowitz runs test but during testing I encountered weird behaviour, the steps I take are the following:

  1. I take two samples out of the same distribution:

    import numpy as np
    list_dist_A = np.random.chisquare(2, 1000)
    list_dist_B = np.random.chisquare(2, 1000)
    
  2. I concatenate the two lists and sort them, while remembering which number came from which sample. The following function does that and it returns a list of labels ["A","B","A","A", ... "B"]

    def _get_runs_list(list1, list2):
         # Add labels  
         l1 = list(map(lambda x: (x, "A"), list1))
         l2 = list(map(lambda x: (x, "B"), list2))
         # Concatenate
         lst = l1 + l2
         # Sort
         sorted_list = sorted(lst, key=lambda x: x[0])
         # Return only the labels:
         return [l[1] for l in sorted_list]
    
  3. Now I want to calculate the number of runs (a consecutive sequence of identical labels). e.g.:

    • a,b,a,b has 4 runs
    • a,a,a,b,b has 2 runs
    • a,b,b,b,a,a has 3 runs

    For this I use the following code:

    def _calculate_nruns(labels):
        nruns = 0
        last_seen = None
    
        for label in labels:
            if label != last_seen:
                nruns += 1
            last_seen = label
    
        return nruns
    

Since all elements are randomly drawn I thought that I should roughly end up with a sequence a,b,a,b,a,b... So this would mean that the number of runs is roughly 2000. However as can be seen in this snippet on "repl.it" this is not the case, it is always roughly around 1000.

Can anyone explain why this is the case?

like image 382
warreee Avatar asked Dec 28 '25 05:12

warreee


2 Answers

~1000 is the expected result. Following the Wikipedia article on this statistical test, you have Np = Nn = 1000 and N = Np + Nn = 2000. That means that the expected value for the number of runs is mu = 2 * Np * Nn / N + 1 which is 1001.

like image 162
Robert Kern Avatar answered Dec 30 '25 21:12

Robert Kern


Oh, that reminds me of the Gambler's fallacy.

I'm not a statistician but to get 2000 runs you would need a 100% chance that A follows B and B follows A. This would indicate that the PRNG has some sort of memory of previous draws. That wouldn't be good...

OTOH, assume you have drawn a value labelled A, then there's a 50% chance to draw another A and a 50% chance to draw a B. So the chance to draw a length-one-run is actually only 50%, the chance to get a length-two run is 25%, for length-three it's 12.5%, for length-four it's 6.25 and so on.

The last part can easily be verified:

import numpy as np
list_dist_A = np.random.chisquare(2, 1000)
list_dist_B = np.random.chisquare(2, 1000)

listA = [(value, 'A') for value in list_dist_A]
listB = [(value, 'B') for value in list_dist_B]
combined = sorted(listA+listB, key=lambda x: x[0])
combined = [x[1] for x in combined]

from itertools import groupby
from collections import Counter

runlengths = [len(list(it)) for _, it in groupby(combined)]  # lengths of the individual runs
print(Counter(runlengths))  # similar to a histogram
# Counter({1: 497, 2: 234, 3: 131, 4: 65, 5: 29, 6: 20, 7: 11, 8: 2, 10: 1, 14: 1})

So this is actually quite close to the expectation (which would be: 1: 500, 2: 250, 3: 125, 4:62, ... as mentioned above). If your assumption would be correct it would be more close to 1:2000, 2: 0, ...

like image 39
MSeifert Avatar answered Dec 30 '25 20:12

MSeifert



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!