How to generate random numbers to satisfy a specific mean and median in python?

Tags:

I would like to generate n random numbers e.g., n=200, where the range of possible values is between 2 and 40 with a mean of 12 and median is 6.5.

I searched everywhere and i could not find a solution for this. I tried the following script by it works for small numbers such as 20, for big numbers it takes ages and result is returned.

n=200
x = np.random.randint(0,1,size=n) # initalisation only
while True:
        if x.mean() == 12 and np.median(x) == 6.5:
            break
        else:
            x=np.random.randint(2,40,size=n)

Could anyone help me by improving this to get a quick result even when n=5000 or so?

808

asked Apr 16 '18 10:04

MWH

2 Answers

Here, you want a median value lesser than the mean value. That means that a uniform distribution is not appropriate: you want many little values and fewer great ones.

Specifically, you want as many value lesser or equal to 6 as the number of values greater or equal to 7.

A simple way to ensure that the median will be 6.5 is to have the same number of values in the range [ 2 - 6 ] as in [ 7 - 40 ]. If you choosed uniform distributions in both ranges, you would have a theorical mean of 13.75, which is not that far from the required 12.

A slight variation on the weights can make the theorical mean even closer: if we use [ 5, 4, 3, 2, 1, 1, ..., 1 ] for the relative weights of the random.choices of the [ 7, 8, ..., 40 ] range, we find a theorical mean of 19.98 for that range, which is close enough to the expected 20.

Example code:

>>> pop1 = list(range(2, 7))
>>> pop2 = list(range(7, 41))
>>> w2 = [ 5, 4, 3, 2 ] + ( [1] * 30)
>>> r1 = random.choices(pop1, k=2500)
>>> r2 = random.choices(pop2, w2, k=2500)
>>> r = r1 + r2
>>> random.shuffle(r)
>>> statistics.mean(r)
12.0358
>>> statistics.median(r)
6.5
>>>

So we now have a 5000 values distribution that has a median of exactly 6.5 and a mean value of 12.0358 (this one is random, and another test will give a slightly different value). If we want an exact mean of 12, we just have to tweak some values. Here sum(r) is 60179 when it should be 60000, so we have to decrease 175 values which were neither 2 (would go out of range) not 7 (would change the median).

In the end, a possible generator function could be:

def gendistrib(n):
    if n % 2 != 0 :
        raise ValueError("gendistrib needs an even parameter")
    n2 = n//2     # n / 2 in Python 2
    pop1 = list(range(2, 7))               # lower range
    pop2 = list(range(7, 41))              # upper range
    w2 = [ 5, 4, 3, 2 ] + ( [1] * 30)      # weights for upper range
    r1 = random.choices(pop1, k=n2)        # lower part of the distrib.
    r2 = random.choices(pop2, w2, k=n2)    # upper part
    r = r1 + r2
    random.shuffle(r)                      # randomize order
    # time to force an exact mean
    tot = sum(r)
    expected = 12 * n
    if tot > expected:                     # too high: decrease some values
        for i, val in enumerate(r):
            if val != 2 and val != 7:
                r[i] = val - 1
                tot -= 1
                if tot == expected:
                    random.shuffle(r)      # shuffle again the decreased values
                    break
    elif tot < expected:                   # too low: increase some values
        for i, val in enumerate(r):
            if val != 6 and val != 40:
                r[i] = val + 1
                tot += 1
                if tot == expected:
                    random.shuffle(r)      # shuffle again the increased values
                    break
    return r

It is really fast: I could timeit gendistrib(10000) at less than 0.02 seconds. But it should not be used for small distributions (less than 1000)

answered Sep 29 '22 11:09

Serge Ballesta

One way to get a result really close to what you want is to generate two separate random ranges with length 100 that satisfies your median constraints and includes all the desire range of numbers. Then by concatenating the arrays the mean will be around 12 but not quite equal to 12. But since it's just mean that you're dealing with you can simply generate your expected result by tweaking one of these arrays.

In [162]: arr1 = np.random.randint(2, 7, 100)    
In [163]: arr2 = np.random.randint(7, 40, 100)

In [164]: np.mean(np.concatenate((arr1, arr2)))
Out[164]: 12.22

In [166]: np.median(np.concatenate((arr1, arr2)))
Out[166]: 6.5

Following is a vectorized and very much optimized solution against any other solution that uses for loops or python-level code by constraining the random sequence creation:

import numpy as np
import math

def gen_random(): 
    arr1 = np.random.randint(2, 7, 99)
    arr2 = np.random.randint(7, 40, 99)
    mid = [6, 7]
    i = ((np.sum(arr1 + arr2) + 13) - (12 * 200)) / 40
    decm, intg = math.modf(i)
    args = np.argsort(arr2)
    arr2[args[-41:-1]] -= int(intg)
    arr2[args[-1]] -= int(np.round(decm * 40))
    return np.concatenate((arr1, mid, arr2))

Demo:

arr = gen_random()
print(np.median(arr))
print(arr.mean())

6.5
12.0

The logic behind the function:

In order for us to have a random array with that criteria we can concatenate 3 arrays together arr1, mid and arr2. arr1 and arr2 each hold 99 items and the mid holds 2 items 6 and 7 so that make the final result to give as 6.5 as the median. Now we an create two random arrays each with length 99. All we need to do to make the result to have a 12 mean is to find the difference between the current sum and 12 * 200 and subtract the result from our N largest numbers which in this case we can choose them from arr2 and use N=50.

Edit:

If it's not a problem to have float numbers in your result you can actually shorten the function as following:

import numpy as np
import math

def gen_random(): 
    arr1 = np.random.randint(2, 7, 99).astype(np.float)
    arr2 = np.random.randint(7, 40, 99).astype(np.float)
    mid = [6, 7]
    i = ((np.sum(arr1 + arr2) + 13) - (12 * 200)) / 40
    args = np.argsort(arr2)
    arr2[args[-40:]] -= i
    return np.concatenate((arr1, mid, arr2))

answered Sep 29 '22 11:09

Mazdak

Related questions
                            
                                Markdown in Spyder IDE
                            
                                How to update OpenSSL on mac?
                            
                                Create an if-else condition column in dask dataframe
                            
                                people.connections.list not returning contacts using Python Client Library
                            
                                Python script execution time increases when executed multiple time parallely
                            
                                Keras: Use the same layer in different models (share weights)
                            
                                What is the parameter "index" in Pandas.DataFrame.rename method?
                            
                                Understanding memory behavior of Dask distributed
                            
                                How to determine the uncertainty of fit parameters with Python?
                            
                                Possible values for platform.machine()
                            
                                Flask - store object directly in a session [duplicate]
                            
                                Connecting to Amazon Aurora using SQLAlchemy
                            
                                metaclass and __prepare__ ()
                            
                                Flask's built-in server always 404 with SERVER_NAME set
                            
                                Python Abstract class with concrete methods
                            
                                python - how to docstring kwargs and their expected types
                            
                                How to bulk write TFRecords?
                            
                                Render current status only on template in StreamingHttpResponse in Django
                            
                                Django OneToOneField default value
                            
                                Call an async function in an normal function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to generate random numbers to satisfy a specific mean and median in python?

Tags:

python

random

numpy

normal-distribution

MWH

People also ask

2 Answers

Serge Ballesta

Mazdak

Recent Activity

Donate For Us