Generate random numbers replicating arbitrary distribution

Tags:

I have data wherein I have a variable z that contains around 4000 values (from 0.0 to 1.0) for which the histogram looks like this.

enter image description here

Now I need to generate a random variable, call it random_z which should replicate the above distribution.

What I have tried so far is to generate a normal distribution centered at 1.0 so that I can remove all those above 1.0 to get a distribution that will be similar. I have been using numpy.random.normal but the problem is that I cannot set the range from 0.0 to 1.0, because usually normal distribution has a mean = 0.0 and std dev = 1.0.

Is there another way to go about generating this distribution in Python?

643

asked May 13 '14 08:05

Srivatsan

2 Answers

If you want to bootstrap you could use random.choice() on your observed series.

Here I'll assume you'd like to smooth a bit more than that and you aren't concerned with generating new extreme values.

Use pandas.Series.quantile() and a uniform [0,1] random number generator, as follows.

Training

Put your random sample into a pandas Series, call this series S

Production

Generate a random number u between 0.0 and 1.0 the usual way, e.g., random.random()
return S.quantile(u)

If you'd rather use numpy than pandas, from a quick reading it looks like you can substitute numpy.percentile() in step 2.

Principle of Operation:

From the sample S, pandas.series.quantile() or numpy.percentile() is used to calculate the inverse cumulative distribution function for the method of Inverse transform sampling. The quantile or percentile function (relative to S) transforms a uniform [0,1] pseudo random number to a pseudo random number having the range and distribution of the sample S.

Simple Sample Code

If you need to minimize coding and don't want to write and use functions that only returns a single realization, then it seems numpy.percentile bests pandas.Series.quantile.

Let S be a pre-existing sample.

u will be the new uniform random numbers

newR will be the new randoms drawn from a S-like distribution.

>>> import numpy as np

I need a sample of the kind of random numbers to be duplicated to put in S.

For the purposes of creating an example, I am going to raise some uniform [0,1] random numbers to the third power and call that the sample S. By choosing to generate the example sample in this way, I will know in advance -- from the mean being equal to the definite integral of (x^3)(dx) evaluated from 0 to 1 -- that the mean of S should be 1/(3+1) = 1/4 = 0.25

In your application, you would need to do something else instead, perhaps read a file, to create a numpy array S containing the data sample whose distribution is to be duplicated.

>>> S = pow(np.random.random(1000),3)  # S will be 1000 samples of a power distribution

Here I will check that the mean of S is 0.25 as stated above.

>>> S.mean()
0.25296623781420458 # OK

get the min and max just to show how np.percentile works

>>> S.min()
6.1091277680105382e-10
>>> S.max()
0.99608676594692624

The numpy.percentile function maps 0-100 to the range of S.

>>> np.percentile(S,0)  # this should match the min of S
6.1091277680105382e-10 # and it does

>>> np.percentile(S,100) # this should match the max of S
0.99608676594692624 # and it does

>>> np.percentile(S,[0,100])  # this should send back an array with both min, max
[6.1091277680105382e-10, 0.99608676594692624]  # and it does

>>> np.percentile(S,np.array([0,100])) # but this doesn't.... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 2803, in percentile
    if q == 0:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

This isn't so great if we generate 100 new values, starting with uniforms:

>>> u = np.random.random(100)

because it will error out, and the scale of u is 0-1, and 0-100 is needed.

This will work:

>>> newR = np.percentile(S, (100*u).tolist())

which works fine but might need its type adjusted if you want a numpy array back

>>> type(newR)
<type 'list'>

>>> newR = np.array(newR)

Now we have a numpy array. Let's check the mean of the new random values.

>>> newR.mean()
0.25549728059744525 # close enough

answered Oct 16 '22 16:10

Paul

When using numpy.random.normal you can pass keyword arguments to set the mean and standard deviation of your returned array. These keyword arguments are loc (mean) and scale (std).

import numpy as np
import matplotlib.pyplot as plt

N = 4000
mean = 1.0
std = 0.5
x = []

while len(x) < N:
    y = np.random.normal(loc=mean, scale=std, size=1)[0]
    if 0.0 <= y <= 1.0:
        x.append(y)

plt.hist(x)
plt.show()

Plot

answered Oct 16 '22 15:10

Ffisegydd

Related questions
                            
                                Escaped quotes in pandas read_csv
                            
                                How to access a tag called "name" in BeautifulSoup
                            
                                Fill between x and baseline x position in Matplotlib
                            
                                Using supercollider with python
                            
                                Python MemoryError when doing fitting with Scikit-learn
                            
                                How to `strftime` having timezone adjusted?
                            
                                Is Selenium slow, or is my code wrong?
                            
                                Python PrettyPrint output to variable
                            
                                Pandas data from stdin
                            
                                Python Segmentation fault: 11 on OSX
                            
                                Django: can't include the ManyToManyField field because manually specifies a 'through' model
                            
                                Security considerations of `pip --allow-external`
                            
                                Django: Multiple forms possible when using FormView?
                            
                                How to properly use association_proxy and ordering_list together with SQLAlchemy
                            
                                Error importing Theano
                            
                                Confusing with the usage of regex in Python
                            
                                Why can two functions with the same `id` have different attributes?
                            
                                Best way to wake a sleeping thread?
                            
                                How to set breakpoints in a library module (pdb)
                            
                                tkinter enable/disable menu

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Generate random numbers replicating arbitrary distribution

Tags:

python

random

numpy