Let's say I have a data set and used matplotlib to draw a histogram of said data set. <pre class="prettyprint"><code>n, bins, patches = plt.hist(data, normed=1) </code></pre> How do I calculate the standard deviation, using the <code>n</code> and <code>bins</code> values that <code>hist()</code> returns? I'm currently doing this to calculate the mean: <pre class="prettyprint"><code>s = 0 for i in range(len(n)): s += n[i] * ((bins[i] + bins[i+1]) / 2) mean = s / numpy.sum(n) </code></pre> which seems to work fine as I get pretty accurate results. However, if I try to calculate the standard deviation like this: <pre class="prettyprint"><code>t = 0 for i in range(len(n)): t += (bins[i] - mean)**2 std = np.sqrt(t / numpy.sum(n)) </code></pre> my results are way off from what <code>numpy.std(data)</code> returns. Replacing the left bin limits with the central point of each bin doesn't change this either. I have the feeling that the problem is that the <code>n</code> and <code>bins</code> values don't actually contain any information on how the individual data points are distributed within each bin, but the assignment I'm working on clearly demands that I use them to calculate the standard deviation.

You haven't weighted the contribution of each bin with <code>n[i]</code>. Change the increment of <code>t</code> to <pre class="prettyprint"><code> t += n[i]*(bins[i] - mean)**2 </code></pre> By the way, you can simplify (and speed up) your calculation by using <code>numpy.average</code> with the <code>weights</code> argument. Here's an example. First, generate some data to work with. We'll compute the sample mean, variance and standard deviation of the input before computing the histogram. <pre class="prettyprint"><code>In [54]: x = np.random.normal(loc=10, scale=2, size=1000) In [55]: x.mean() Out[55]: 9.9760798903061847 In [56]: x.var() Out[56]: 3.7673459904902025 In [57]: x.std() Out[57]: 1.9409652213499866 </code></pre> I'll use <code>numpy.histogram</code> to compute the histogram: <pre class="prettyprint"><code>In [58]: n, bins = np.histogram(x) </code></pre> <code>mids</code> is the midpoints of the bins; it has the same length as <code>n</code>: <pre class="prettyprint"><code>In [59]: mids = 0.5*(bins[1:] + bins[:-1]) </code></pre> The estimate of the mean is the weighted average of <code>mids</code>: <pre class="prettyprint"><code>In [60]: mean = np.average(mids, weights=n) In [61]: mean Out[61]: 9.9763028267760312 </code></pre> In this case, it is pretty close to the mean of the original data. The estimated variance is the weighted average of the squared difference from the mean: <pre class="prettyprint"><code>In [62]: var = np.average((mids - mean)**2, weights=n) In [63]: var Out[63]: 3.8715035807387328 In [64]: np.sqrt(var) Out[64]: 1.9676136767004677 </code></pre> That estimate is within 2% of the actual sample standard deviation.

How to calculate the standard deviation from a histogram? (Python, Matplotlib)

Let's say I have a data set and used matplotlib to draw a histogram of said data set.

n, bins, patches = plt.hist(data, normed=1)

How do I calculate the standard deviation, using the n and bins values that hist() returns? I'm currently doing this to calculate the mean:

s = 0
for i in range(len(n)):
   s += n[i] * ((bins[i] + bins[i+1]) / 2) 
mean = s / numpy.sum(n)

which seems to work fine as I get pretty accurate results. However, if I try to calculate the standard deviation like this:

t = 0
for i in range(len(n)):
  t += (bins[i] - mean)**2
std = np.sqrt(t / numpy.sum(n))

my results are way off from what numpy.std(data) returns. Replacing the left bin limits with the central point of each bin doesn't change this either. I have the feeling that the problem is that the n and bins values don't actually contain any information on how the individual data points are distributed within each bin, but the assignment I'm working on clearly demands that I use them to calculate the standard deviation.

How do you find the standard deviation in Python?

Coding a stdev() Function in Python sqrt() to take the square root of the variance. With this new implementation, we can use ddof=0 to calculate the standard deviation of a population, or we can use ddof=1 to estimate the standard deviation of a population using a sample of data.

How do you extract data from a histogram in Python?

To get the values from a histogram, plt. hist returns them, so all you have to do is save them.

You haven't weighted the contribution of each bin with n[i]. Change the increment of t to

    t += n[i]*(bins[i] - mean)**2

By the way, you can simplify (and speed up) your calculation by using numpy.average with the weights argument.

Here's an example. First, generate some data to work with. We'll compute the sample mean, variance and standard deviation of the input before computing the histogram.

In [54]: x = np.random.normal(loc=10, scale=2, size=1000)

In [55]: x.mean()
Out[55]: 9.9760798903061847

In [56]: x.var()
Out[56]: 3.7673459904902025

In [57]: x.std()
Out[57]: 1.9409652213499866

I'll use numpy.histogram to compute the histogram:

In [58]: n, bins = np.histogram(x)

mids is the midpoints of the bins; it has the same length as n:

In [59]: mids = 0.5*(bins[1:] + bins[:-1])

The estimate of the mean is the weighted average of mids:

In [60]: mean = np.average(mids, weights=n)

In [61]: mean
Out[61]: 9.9763028267760312

In this case, it is pretty close to the mean of the original data.

The estimated variance is the weighted average of the squared difference from the mean:

In [62]: var = np.average((mids - mean)**2, weights=n)

In [63]: var
Out[63]: 3.8715035807387328

In [64]: np.sqrt(var)
Out[64]: 1.9676136767004677

That estimate is within 2% of the actual sample standard deviation.

How to calculate the standard deviation from a histogram? (Python, Matplotlib)

Tags:

python-3.x

matplotlib

numpy

justaname

People also ask

1 Answers

Warren Weckesser

Recent Activity

Donate For Us

How to calculate the standard deviation from a histogram? (Python, Matplotlib)

Tags:

python-3.x

matplotlib

numpy

justaname

People also ask

1 Answers

Warren Weckesser

Related questions

Recent Activity

Donate For Us