Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Computing Standard Deviation in a stream

Tags:

python

math

Using Python, assume I'm running through a known quantity of items I, and have the ability to time how long it takes to process each one t, as well as a running total of time spent processing T and the number of items processed so far c. I'm currently calculating the average on the fly A = T / c but this can be skewed by say a single item taking an extraordinarily long time to process (a few seconds compared to a few milliseconds).

I would like to show a running Standard Deviation. How can I do this without keeping a record of each t?

like image 486
Josh K Avatar asked Apr 04 '11 20:04

Josh K


People also ask

How do you calculate standard deviation when streaming?

Standard Deviation = s = 1/n ∑i=1n (xi – u)

How is standard deviation calculated?

Standard deviation is a measure of dispersion of data values from the mean. The formula for standard deviation is the square root of the sum of squared differences from the mean divided by the size of the data set.

How is streaming data calculated?

To calculate the total data streaming rate for a single live class, take the data rate for the media combination you are streaming (from the above table) and multiply it by the number of students watching the live class ON CAMPUS. You don't have to worry about students watching from somewhere else.


2 Answers

As outlined in the Wikipedia article on the standard deviation, it is enough to keep track of the following three sums:

s0 = sum(1 for x in samples) s1 = sum(x for x in samples) s2 = sum(x*x for x in samples) 

These sums are easily updated as new values arrive. The standard deviation can be calculated as

std_dev = math.sqrt((s0 * s2 - s1 * s1)/(s0 * (s0 - 1))) 

Note that this way of computing the standard deviation can be numerically ill-conditioned if your samples are floating point numbers and the standard deviation is small compared to the mean of the samples. If you expect samples of this type, you should resort to Welford's method (see the accepted answer).

like image 193
Sven Marnach Avatar answered Oct 04 '22 01:10

Sven Marnach


Based on Welford's algorithm:

import numpy as np  class OnlineVariance(object):     """     Welford's algorithm computes the sample variance incrementally.     """      def __init__(self, iterable=None, ddof=1):         self.ddof, self.n, self.mean, self.M2 = ddof, 0, 0.0, 0.0         if iterable is not None:             for datum in iterable:                 self.include(datum)      def include(self, datum):         self.n += 1         self.delta = datum - self.mean         self.mean += self.delta / self.n         self.M2 += self.delta * (datum - self.mean)      @property     def variance(self):         return self.M2 / (self.n - self.ddof)      @property     def std(self):         return np.sqrt(self.variance) 

Update the variance with each new piece of data:

N = 100 data = np.random.random(N) ov = OnlineVariance(ddof=0) for d in data:     ov.include(d) std = ov.std print(std) 

Check our result against the standard deviation computed by numpy:

assert np.allclose(std, data.std()) 
like image 35
unutbu Avatar answered Oct 03 '22 23:10

unutbu