Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate random numbers to satisfy a specific mean and median in python?

I would like to generate n random numbers e.g., n=200, where the range of possible values is between 2 and 40 with a mean of 12 and median is 6.5.

I searched everywhere and i could not find a solution for this. I tried the following script by it works for small numbers such as 20, for big numbers it takes ages and result is returned.

n=200
x = np.random.randint(0,1,size=n) # initalisation only
while True:
        if x.mean() == 12 and np.median(x) == 6.5:
            break
        else:
            x=np.random.randint(2,40,size=n)

Could anyone help me by improving this to get a quick result even when n=5000 or so?

like image 808
MWH Avatar asked Apr 16 '18 10:04

MWH


People also ask

How do you generate random numbers in a specific range in Python?

Use a random. randrange() function to get a random integer number from the given exclusive range by specifying the increment. For example, random. randrange(0, 10, 2) will return any random number between 0 and 20 (like 0, 2, 4, 6, 8).

How do you generate a random number with mean and standard deviation in Python?

I need to know how to generate 1000 random numbers between 500 and 600 that has a mean = 550 and standard deviation = 30 in python. import pylab import random xrandn = pylab. zeros(1000,float) for j in range(500,601): xrandn[j] = pylab.

How do you generate a random number around a mean?

Use the formula "=NORMINV(RAND(),B2,C2)", where the RAND() function creates your probability, B2 provides your mean and C2 references your standard deviation. You can change B2 and C2 to reference different cells or enter the values into the formula itself.


2 Answers

Here, you want a median value lesser than the mean value. That means that a uniform distribution is not appropriate: you want many little values and fewer great ones.

Specifically, you want as many value lesser or equal to 6 as the number of values greater or equal to 7.

A simple way to ensure that the median will be 6.5 is to have the same number of values in the range [ 2 - 6 ] as in [ 7 - 40 ]. If you choosed uniform distributions in both ranges, you would have a theorical mean of 13.75, which is not that far from the required 12.

A slight variation on the weights can make the theorical mean even closer: if we use [ 5, 4, 3, 2, 1, 1, ..., 1 ] for the relative weights of the random.choices of the [ 7, 8, ..., 40 ] range, we find a theorical mean of 19.98 for that range, which is close enough to the expected 20.

Example code:

>>> pop1 = list(range(2, 7))
>>> pop2 = list(range(7, 41))
>>> w2 = [ 5, 4, 3, 2 ] + ( [1] * 30)
>>> r1 = random.choices(pop1, k=2500)
>>> r2 = random.choices(pop2, w2, k=2500)
>>> r = r1 + r2
>>> random.shuffle(r)
>>> statistics.mean(r)
12.0358
>>> statistics.median(r)
6.5
>>>

So we now have a 5000 values distribution that has a median of exactly 6.5 and a mean value of 12.0358 (this one is random, and another test will give a slightly different value). If we want an exact mean of 12, we just have to tweak some values. Here sum(r) is 60179 when it should be 60000, so we have to decrease 175 values which were neither 2 (would go out of range) not 7 (would change the median).

In the end, a possible generator function could be:

def gendistrib(n):
    if n % 2 != 0 :
        raise ValueError("gendistrib needs an even parameter")
    n2 = n//2     # n / 2 in Python 2
    pop1 = list(range(2, 7))               # lower range
    pop2 = list(range(7, 41))              # upper range
    w2 = [ 5, 4, 3, 2 ] + ( [1] * 30)      # weights for upper range
    r1 = random.choices(pop1, k=n2)        # lower part of the distrib.
    r2 = random.choices(pop2, w2, k=n2)    # upper part
    r = r1 + r2
    random.shuffle(r)                      # randomize order
    # time to force an exact mean
    tot = sum(r)
    expected = 12 * n
    if tot > expected:                     # too high: decrease some values
        for i, val in enumerate(r):
            if val != 2 and val != 7:
                r[i] = val - 1
                tot -= 1
                if tot == expected:
                    random.shuffle(r)      # shuffle again the decreased values
                    break
    elif tot < expected:                   # too low: increase some values
        for i, val in enumerate(r):
            if val != 6 and val != 40:
                r[i] = val + 1
                tot += 1
                if tot == expected:
                    random.shuffle(r)      # shuffle again the increased values
                    break
    return r

It is really fast: I could timeit gendistrib(10000) at less than 0.02 seconds. But it should not be used for small distributions (less than 1000)

like image 32
Serge Ballesta Avatar answered Sep 29 '22 11:09

Serge Ballesta


One way to get a result really close to what you want is to generate two separate random ranges with length 100 that satisfies your median constraints and includes all the desire range of numbers. Then by concatenating the arrays the mean will be around 12 but not quite equal to 12. But since it's just mean that you're dealing with you can simply generate your expected result by tweaking one of these arrays.

In [162]: arr1 = np.random.randint(2, 7, 100)    
In [163]: arr2 = np.random.randint(7, 40, 100)

In [164]: np.mean(np.concatenate((arr1, arr2)))
Out[164]: 12.22

In [166]: np.median(np.concatenate((arr1, arr2)))
Out[166]: 6.5

Following is a vectorized and very much optimized solution against any other solution that uses for loops or python-level code by constraining the random sequence creation:

import numpy as np
import math

def gen_random(): 
    arr1 = np.random.randint(2, 7, 99)
    arr2 = np.random.randint(7, 40, 99)
    mid = [6, 7]
    i = ((np.sum(arr1 + arr2) + 13) - (12 * 200)) / 40
    decm, intg = math.modf(i)
    args = np.argsort(arr2)
    arr2[args[-41:-1]] -= int(intg)
    arr2[args[-1]] -= int(np.round(decm * 40))
    return np.concatenate((arr1, mid, arr2))

Demo:

arr = gen_random()
print(np.median(arr))
print(arr.mean())

6.5
12.0

The logic behind the function:

In order for us to have a random array with that criteria we can concatenate 3 arrays together arr1, mid and arr2. arr1 and arr2 each hold 99 items and the mid holds 2 items 6 and 7 so that make the final result to give as 6.5 as the median. Now we an create two random arrays each with length 99. All we need to do to make the result to have a 12 mean is to find the difference between the current sum and 12 * 200 and subtract the result from our N largest numbers which in this case we can choose them from arr2 and use N=50.

Edit:

If it's not a problem to have float numbers in your result you can actually shorten the function as following:

import numpy as np
import math

def gen_random(): 
    arr1 = np.random.randint(2, 7, 99).astype(np.float)
    arr2 = np.random.randint(7, 40, 99).astype(np.float)
    mid = [6, 7]
    i = ((np.sum(arr1 + arr2) + 13) - (12 * 200)) / 40
    args = np.argsort(arr2)
    arr2[args[-40:]] -= i
    return np.concatenate((arr1, mid, arr2))
like image 60
Mazdak Avatar answered Sep 29 '22 11:09

Mazdak