Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Generate random values from empirical distribution

In Java, I usually rely on the org.apache.commons.math3.random.EmpiricalDistribution class to do the following:

  • Derive a probability distribution from observed data.
  • Generate random values from this distribution.

Is there any Python library that provides the same functionality? It seems like scipy.stats.gaussian_kde.resample does something similar, but I'm not sure if it implements the same procedure as the Java type I'm familiar with.

like image 306
Carlos Gavidia-Calderon Avatar asked Feb 16 '16 13:02

Carlos Gavidia-Calderon


People also ask

How do we simulate random numbers from empirical discrete distributions?

Random values are generated by applying the probability integral transform to the empirical cdf using uniformly distributed random variable (U) on the interval[0,1]. If U corresponds directly to the cdf probability of a particular empirical observation, then the actual observation is selected.

How do you simulate an empirical distribution?

We simply find the random number value on the vertical axis and go to the right until intersecting the distribution curve. Then go down to the horizontal axis and record the value. The value is the sample. In the example below, we generated a random number of 0.44.

How do you find the empirical distribution in Python?

The EDF is calculated by ordering all of the unique observations in the data sample and calculating the cumulative probability for each as the number of observations less than or equal to a given observation divided by the total number of observations. As follows: EDF(x) = number of observations <= x / n.


1 Answers

import numpy as np
import scipy.stats
import matplotlib.pyplot as plt

# This represents the original "empirical" sample -- I fake it by
# sampling from a normal distribution
orig_sample_data = np.random.normal(size=10000)

# Generate a KDE from the empirical sample
sample_pdf = scipy.stats.gaussian_kde(orig_sample_data)

# Sample new datapoints from the KDE
new_sample_data = sample_pdf.resample(10000).T[:,0]

# Histogram of initial empirical sample
cnts, bins, p = plt.hist(orig_sample_data, label='original sample', bins=100,
                         histtype='step', linewidth=1.5, density=True)

# Histogram of datapoints sampled from KDE
plt.hist(new_sample_data, label='sample from KDE', bins=bins,
         histtype='step', linewidth=1.5, density=True)

# Visualize the kde itself
y_kde = sample_pdf(bins)
plt.plot(bins, y_kde, label='KDE')
plt.legend()
plt.show(block=False)

resulting plot

new_sample_data should be drawn from roughly the same distribution as the original data (to the degree that the KDE is a good approximation to the original distribution).

like image 166
abeboparebop Avatar answered Sep 30 '22 19:09

abeboparebop