Say I have several histograms, each with counts at different bin locations (on a real axis). e.g.
def generate_random_histogram():
# Random bin locations between 0 and 100
bin_locations = np.random.rand(10,) * 100
bin_locations.sort()
# Random counts between 0 and 50 on those locations
bin_counts = np.random.randint(50, size=len(bin_locations))
return {'loc': bin_locations, 'count':bin_counts}
# We can assume that the bin size is either pre-defined or that
# the bin edges are on the middle-point between consecutive counts.
hists = [generate_random_histogram() for x in xrange(3)]
How can I normalize these histograms so that I get PDFs where the integral of each PDF adds up to one within a given range (e.g. 0 and 100)?
We can assume that the histogram counts events on pre-defined bin size (e.g. 10)
Most implementations I have seen are based, for example, on Gaussian Kernels (see scipy
and scikit-learn
) that start from the data. In my case, I need to do this from the histograms, since I don't have access to the original data.
Note that all current answers assume that we are looking at a random variable that lives in (-Inf, +Inf). That's fine as a rough approximation, but this may not be the case depending on the application, where the variable may be defined within some other range [a,b]
(e.g. 0 and 100 in the above case)
The main point of delicacy is defining bin_edges
since technically they could be anywhere. I chose the midpoint between each pair of bin centers. Probably there are other ways to do this, but here is one:
hists = [generate_random_histogram() for x in xrange(3)]
for h in hists:
bin_locations = h['loc']
bin_counts = h['count']
bin_edges = np.concatenate([[0], (bin_locations[1:] + bin_locations[:-1])/2, [100]])
bin_widths = np.diff(bin_edges)
bin_density = bin_counts.astype(float) / np.dot(bin_widths, bin_counts)
h['density'] = bin_density
data = np.repeat(bin_locations, bin_counts)
h['kde'] = gaussian_kde(data)
plt.step(bin_locations, bin_density, where='mid', label='normalized')
plt.plot(np.linspace(0,100), h['kde'](np.linspace(0,100)), label='kde')
Which would result in plots like the following (one for each histogram):
Here is one possibility. I'm not that crazy about it, but it kind of works. Note that the histograms are kind of weird, as the bin width is quite variable.
import numpy as np
from matplotlib import pyplot
from sklearn.mixture.gmm import GMM
from sklearn.grid_search import GridSearchCV
def generate_random_histogram():
# Random bin locations between 0 and 100
bin_locations = np.random.rand(10,) * 100
bin_locations.sort()
# Random counts on those locations
bin_counts = np.random.randint(50, size=len(bin_locations))
return {'loc': bin_locations, 'count':bin_counts}
def bin_widths(loc):
widths = []
for i in range(len(loc)-1):
widths.append(loc[i+1] - loc[i])
widths.append(widths[-1])
widths = np.array(widths)
return widths
def sample_from_hist(loc, count, size):
n = len(loc)
tot = np.sum(count)
widths = bin_widths(loc)
lowers = loc - widths
uppers = loc + widths
probs = count / float(tot)
bins = np.argmax(np.random.multinomial(n=1, pvals=probs, size=(size,)),1)
return np.random.uniform(lowers[bins], uppers[bins])
# Generate the histogram
hist = generate_random_histogram()
# Sample from the histogram
sample = sample_from_hist(hist['loc'],hist['count'],np.sum(hist['count']))
# Fit a GMM
param_grid = [{'n_components':[1,2,3,4,5]}]
def scorer(est, X, y=None):
return np.sum(est.score(X))
grid_search = GridSearchCV(GMM(), param_grid, scoring=scorer).fit(np.reshape(sample,(len(sample),1)))
gmm = grid_search.best_estimator_
# Sample from the GMM
gmm_sample = gmm.sample(np.sum(hist['count']))
# Plot the resulting histograms and mixture distribution
fig = pyplot.figure()
ax1 = fig.add_subplot(311)
widths = bin_widths(hist['loc'])
ax1.bar(hist['loc'], hist['count'], width=widths)
ax2 = fig.add_subplot(312, sharex=ax1)
ax2.hist(gmm_sample, bins=hist['loc'])
x = np.arange(min(sample), max(sample), .1)
y = np.exp(gmm.score(x))
ax3 = fig.add_subplot(313, sharex=ax1)
ax3.plot(x, y)
pyplot.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With