Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can't get y-axis on Matplotlib histogram to display probabilities

I have data (pd Series) that looks like (daily stock returns, n = 555):

S = perf_manual.returns
S = S[~((S-S.mean()).abs()>3*S.std())]

2014-03-31 20:00:00    0.000000
2014-04-01 20:00:00    0.000000
2014-04-03 20:00:00   -0.001950
2014-04-04 20:00:00   -0.000538
2014-04-07 20:00:00    0.000764
2014-04-08 20:00:00    0.000803
2014-04-09 20:00:00    0.001961
2014-04-10 20:00:00    0.040530
2014-04-11 20:00:00   -0.032319
2014-04-14 20:00:00   -0.008512
2014-04-15 20:00:00   -0.034109
...

I'd like to generate a probability distribution plot from this. Using:

print stats.normaltest(S)

n, bins, patches = plt.hist(S, 100, normed=1, facecolor='blue', alpha=0.75)
print np.sum(n * np.diff(bins))

(mu, sigma) = stats.norm.fit(S)
print mu, sigma
y = mlab.normpdf(bins, mu, sigma)
plt.grid(True)
l = plt.plot(bins, y, 'r', linewidth=2)

plt.xlim(-0.05,0.05)
plt.show()

I get the following:

NormaltestResult(statistic=66.587382579416982, pvalue=3.473230376732532e-15)
1.0
0.000495624926242 0.0118790391467

graph

I have the impression the y-axis is a count, but I'd like to have probabilities instead. How do I do that? I've tried a whole lot of StackOverflow answers and can't figure this out.

like image 557
Joël Avatar asked Dec 08 '22 21:12

Joël


1 Answers

There is no easy way (that I know of) to do that using plt.hist. But you can simply bin the data using np.histogram and then normalize the data any way you want. If I understood you correctly, you want the data to display the probability to find a point in a given bin, NOT the probability distribution. That means you have to scale your data that the sum over all bins is 1. That can simply be done by doing bin_probability = n/float(n.sum()).

You will then not have a properly normalized probability distribution function (pdf) anymore, meaning that the integral over an interval will not be a probability! That is the reason, why you have to rescale your mlab.normpdf to have the same norm as your histogram. The factor needed is just the bin width, because when you start from the properly normalized binned pdf the sum over all bins times their respective width is 1. Now you want to have just the sum of bins equal to 1. So the scaling factor is the bin width.

Therefore, the code you end up with is something along the lines of:

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab

# Produce test data
S = np.random.normal(0, 0.01, size=1000)

# Histogram:
# Bin it
n, bin_edges = np.histogram(S, 100)
# Normalize it, so that every bins value gives the probability of that bin
bin_probability = n/float(n.sum())
# Get the mid points of every bin
bin_middles = (bin_edges[1:]+bin_edges[:-1])/2.
# Compute the bin-width
bin_width = bin_edges[1]-bin_edges[0]
# Plot the histogram as a bar plot
plt.bar(bin_middles, bin_probability, width=bin_width)

# Fit to normal distribution
(mu, sigma) = stats.norm.fit(S)
# The pdf should not normed anymore but scaled the same way as the data
y = mlab.normpdf(bin_middles, mu, sigma)*bin_width
l = plt.plot(bin_middles, y, 'r', linewidth=2)

plt.grid(True)
plt.xlim(-0.05,0.05)
plt.show()

And the resulting picture will be:

enter image description here

like image 96
jotasi Avatar answered Dec 10 '22 09:12

jotasi