Making Probability Distribution Functions (PDFs) from histograms

Question

Say I have several histograms, each with counts at different bin locations (on a real axis). e.g.

def generate_random_histogram():

    # Random bin locations between 0 and 100
    bin_locations = np.random.rand(10,) * 100
    bin_locations.sort()

    # Random counts between 0 and 50 on those locations 
    bin_counts = np.random.randint(50, size=len(bin_locations))
    return {'loc': bin_locations, 'count':bin_counts}

# We can assume that the bin size is either pre-defined or that 
# the bin edges are on the middle-point between consecutive counts.
hists = [generate_random_histogram() for x in xrange(3)]

How can I normalize these histograms so that I get PDFs where the integral of each PDF adds up to one within a given range (e.g. 0 and 100)?

We can assume that the histogram counts events on pre-defined bin size (e.g. 10)

Most implementations I have seen are based, for example, on Gaussian Kernels (see scipy and scikit-learn) that start from the data. In my case, I need to do this from the histograms, since I don't have access to the original data.

Update:

Note that all current answers assume that we are looking at a random variable that lives in (-Inf, +Inf). That's fine as a rough approximation, but this may not be the case depending on the application, where the variable may be defined within some other range [a,b] (e.g. 0 and 100 in the above case)

askewchan · Accepted Answer

The main point of delicacy is defining bin_edges since technically they could be anywhere. I chose the midpoint between each pair of bin centers. Probably there are other ways to do this, but here is one:

hists = [generate_random_histogram() for x in xrange(3)]
for h in hists:
    bin_locations = h['loc']
    bin_counts = h['count']
    bin_edges = np.concatenate([[0], (bin_locations[1:] + bin_locations[:-1])/2, [100]])
    bin_widths = np.diff(bin_edges)
    bin_density = bin_counts.astype(float) / np.dot(bin_widths, bin_counts)
    h['density'] = bin_density

    data = np.repeat(bin_locations, bin_counts)
    h['kde'] = gaussian_kde(data)

    plt.step(bin_locations, bin_density, where='mid', label='normalized')
    plt.plot(np.linspace(0,100), h['kde'](np.linspace(0,100)), label='kde')

Which would result in plots like the following (one for each histogram): hists

jcrudy · Answer

Here is one possibility. I'm not that crazy about it, but it kind of works. Note that the histograms are kind of weird, as the bin width is quite variable.

import numpy as np
from matplotlib import pyplot
from sklearn.mixture.gmm import GMM
from sklearn.grid_search import GridSearchCV

def generate_random_histogram():
    # Random bin locations between 0 and 100
    bin_locations = np.random.rand(10,) * 100
    bin_locations.sort()

    # Random counts on those locations
    bin_counts = np.random.randint(50, size=len(bin_locations))
    return {'loc': bin_locations, 'count':bin_counts}

def bin_widths(loc):
    widths = []
    for i in range(len(loc)-1):
        widths.append(loc[i+1] - loc[i])
    widths.append(widths[-1])
    widths = np.array(widths)
    return widths

def sample_from_hist(loc, count, size):
    n = len(loc)
    tot = np.sum(count)
    widths = bin_widths(loc)
    lowers = loc - widths
    uppers = loc + widths
    probs = count / float(tot)
    bins = np.argmax(np.random.multinomial(n=1, pvals=probs, size=(size,)),1)
    return np.random.uniform(lowers[bins], uppers[bins])

# Generate the histogram
hist = generate_random_histogram()

# Sample from the histogram
sample = sample_from_hist(hist['loc'],hist['count'],np.sum(hist['count']))

# Fit a GMM
param_grid = [{'n_components':[1,2,3,4,5]}]
def scorer(est, X, y=None):
    return np.sum(est.score(X))
grid_search = GridSearchCV(GMM(), param_grid, scoring=scorer).fit(np.reshape(sample,(len(sample),1)))
gmm = grid_search.best_estimator_

# Sample from the GMM
gmm_sample = gmm.sample(np.sum(hist['count']))

# Plot the resulting histograms and mixture distribution
fig = pyplot.figure()
ax1 = fig.add_subplot(311)
widths = bin_widths(hist['loc'])
ax1.bar(hist['loc'], hist['count'], width=widths)
ax2 = fig.add_subplot(312, sharex=ax1)
ax2.hist(gmm_sample, bins=hist['loc'])
x = np.arange(min(sample), max(sample), .1)
y = np.exp(gmm.score(x))
ax3 = fig.add_subplot(313, sharex=ax1)
ax3.plot(x, y)
pyplot.show()

resulting plot

Making Probability Distribution Functions (PDFs) from histograms

Tags:

python

scipy

scikit-learn

statsmodels

pymc

Update:

Amelio Vazquez-Reina

2 Answers

askewchan

jcrudy

Recent Activity

Donate For Us

Making Probability Distribution Functions (PDFs) from histograms

Tags:

python

scipy

scikit-learn

statsmodels

pymc

Update:

Amelio Vazquez-Reina

2 Answers

askewchan

jcrudy

Related questions

Recent Activity

Donate For Us