Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

probability density function from histogram in python to fit another histrogram

I have a question concerning fitting and getting random numbers.

Situation is as such:

Firstly I have a histogram from data points.

import numpy as np

"""create random data points """
mu = 10
sigma = 5
n = 1000

datapoints = np.random.normal(mu,sigma,n)

""" create normalized histrogram of the data """

bins = np.linspace(0,20,21)
H, bins = np.histogram(data,bins,density=True)

I would like to interpret this histogram as probability density function (with e.g. 2 free parameters) so that I can use it to produce random numbers AND also I would like to use that function to fit another histogram.

Thanks for your help

like image 432
madzone Avatar asked Nov 20 '12 15:11

madzone


1 Answers

You can use a cumulative density function to generate random numbers from an arbitrary distribution, as described here.

Using a histogram to produce a smooth cumulative density function is not entirely trivial; you can use interpolation for example scipy.interpolate.interp1d() for values in between the centers of your bins and that will work fine for a histogram with a reasonably large number of bins and items. However you have to decide on the form of the tails of the probability function, ie for values less than the smallest bin or greater than the largest bin. You could give your distribution gaussian tails based on for example fitting a gaussian to your histogram), or any other form of tail appropriate to your problem, or simply truncate the distribution.

Example:

import numpy
import scipy.interpolate
import random
import matplotlib.pyplot as pyplot

# create some normally distributed values and make a histogram
a = numpy.random.normal(size=10000)
counts, bins = numpy.histogram(a, bins=100, density=True)
cum_counts = numpy.cumsum(counts)
bin_widths = (bins[1:] - bins[:-1])

# generate more values with same distribution
x = cum_counts*bin_widths
y = bins[1:]
inverse_density_function = scipy.interpolate.interp1d(x, y)
b = numpy.zeros(10000)
for i in range(len( b )):
    u = random.uniform( x[0], x[-1] )
    b[i] = inverse_density_function( u )

# plot both        
pyplot.hist(a, 100) 
pyplot.hist(b, 100)
pyplot.show()

This doesn't handle tails, and it could handle bin edges better, but it would get you started on using a histogram to generate more values with the same distribution.

P.S. You could also try to fit a specific known distribution described by a few values (which I think is what you had mentioned in the question) but the above non-parametric approach is more general-purpose.

like image 155
Alex I Avatar answered Oct 11 '22 07:10

Alex I