How to efficiently calculate a running standard deviation?

Tags:

I have an array of lists of numbers, e.g.:

[0] (0.01, 0.01, 0.02, 0.04, 0.03) [1] (0.00, 0.02, 0.02, 0.03, 0.02) [2] (0.01, 0.02, 0.02, 0.03, 0.02)      ... [n] (0.01, 0.00, 0.01, 0.05, 0.03)

What I would like to do is efficiently calculate the mean and standard deviation at each index of a list, across all array elements.

To do the mean, I have been looping through the array and summing the value at a given index of a list. At the end, I divide each value in my "averages list" by n (I am working with a population, not a sample from the population).

To do the standard deviation, I loop through again, now that I have the mean calculated.

I would like to avoid going through the array twice, once for the mean and then once for the SD (after I have a mean).

Is there an efficient method for calculating both values, only going through the array once? Any code in an interpreted language (e.g. Perl or Python) or pseudocode is fine.

324

asked Jul 23 '09 23:07

Alex Reynolds

2 Answers

The answer is to use Welford's algorithm, which is very clearly defined after the "naive methods" in:

Wikipedia: Algorithms for calculating variance

It's more numerically stable than either the two-pass or online simple sum of squares collectors suggested in other responses. The stability only really matters when you have lots of values that are close to each other as they lead to what is known as "catastrophic cancellation" in the floating point literature.

You might also want to brush up on the difference between dividing by the number of samples (N) and N-1 in the variance calculation (squared deviation). Dividing by N-1 leads to an unbiased estimate of variance from the sample, whereas dividing by N on average underestimates variance (because it doesn't take into account the variance between the sample mean and the true mean).

I wrote two blog entries on the topic which go into more details, including how to delete previous values online:

Computing Sample Mean and Variance Online in One Pass
Deleting Values in Welford’s Algorithm for Online Mean and Variance

You can also take a look at my Java implement; the javadoc, source, and unit tests are all online:

Javadoc: stats.OnlineNormalEstimator
Source: stats.OnlineNormalEstimator.java
JUnit Source: test.unit.stats.OnlineNormalEstimatorTest.java
LingPipe Home Page

172

answered Oct 02 '22 11:10

Bob Carpenter

The basic answer is to accumulate the sum of both x (call it 'sum_x1') and x² (call it 'sum_x2') as you go. The value of the standard deviation is then:

Click to copy

stdev = sqrt((sum_x2 / n) - (mean * mean))

where

Click to copy

mean = sum_x / n

This is the sample standard deviation; you get the population standard deviation using 'n' instead of 'n - 1' as the divisor.

You may need to worry about the numerical stability of taking the difference between two large numbers if you are dealing with large samples. Go to the external references in other answers (Wikipedia, etc) for more information.

answered Oct 02 '22 11:10

Jonathan Leffler

Related questions
                            
                                Python: Binding Socket: "Address already in use"
                            
                                Get all child elements
                            
                                TypeError: 'RelatedManager' object is not iterable
                            
                                django abstract models versus regular inheritance
                            
                                How to group pandas DataFrame entries by date in a non-unique column
                            
                                List comprehension without [ ] in Python
                            
                                Python urllib2: Receive JSON response from url
                            
                                Make more than one chart in same IPython Notebook cell
                            
                                ImportError: No module named win32com.client
                            
                                Python Script Uploading files via FTP
                            
                                Pandas/Python: Set value of one column based on value in another column
                            
                                Get Filename Without Extension in Python
                            
                                Subtract a value from every number in a list in Python?
                            
                                How do I run Selenium in Xvfb?
                            
                                Clear screen in shell
                            
                                Find common substring between two strings
                            
                                How should I verify a log message when testing Python code under nose?
                            
                                zsh: no matches found: requests[security]
                            
                                pip connection failure: cannot fetch index base URL http://pypi.python.org/simple/
                            
                                Python: Converting from ISO-8859-1/latin1 to UTF-8

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to efficiently calculate a running standard deviation?

Tags:

python

perl

statistics

Alex Reynolds

People also ask

2 Answers

Bob Carpenter

Jonathan Leffler

Recent Activity

Donate For Us