Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rolling and cumulative standard deviation in a Python dataframe

Is there a vectorized operation to calculate the cumulative and rolling standard deviation (SD) of a Python DataFrame?

For example, I want to add a column 'c' which calculates the cumulative SD based on column 'a', i.e. in index 0, it shows NaN due to 1 data point, and in index 1, it calculates SD based on 2 data points, and so on.

The same question goes to rolling SD too. Is there an efficient way to calculate without iterating through df.itertuples()?

import numpy as np
import pandas as pd

def main():
    np.random.seed(123)
    df = pd.DataFrame(np.random.randn(10, 2), columns=['a', 'b'])
    print(df)

if __name__ == '__main__':
    main()
like image 811
Roy Avatar asked Jul 03 '17 07:07

Roy


2 Answers

For cumulative SD base on columna 'a', let's use rolling with a windows size the length of the dataframe and min_periods = 2:

df['a'].rolling(len(df),min_periods=2).std()

Output:

          a         b         c
0 -1.085631  0.997345       NaN
1  0.282978 -1.506295  0.967753
2 -0.578600  1.651437  0.691916
3 -2.426679 -0.428913  1.133892
4  1.265936 -0.866740  1.395750
5 -0.678886 -0.094709  1.250335
6  1.491390 -0.638902  1.374933
7 -0.443982 -0.434351  1.274843
8  2.205930  2.186786  1.450563
9  1.004054  0.386186  1.403721

And for rolling SD based on two values at a time:

df['c'] = df['a'].rolling(2).std()

Output:

          a         b         c
0 -1.085631  0.997345       NaN
1  0.282978 -1.506295  0.967753
2 -0.578600  1.651437  0.609228
3 -2.426679 -0.428913  1.306789
4  1.265936 -0.866740  2.611073
5 -0.678886 -0.094709  1.375197
6  1.491390 -0.638902  1.534617
7 -0.443982 -0.434351  1.368514
8  2.205930  2.186786  1.873771
9  1.004054  0.386186  0.849855
like image 126
Scott Boston Avatar answered Oct 08 '22 05:10

Scott Boston


I think, if by rolling you mean cumulative, then the right term in Pandas is expanding:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.expanding.html#pandas.DataFrame.expanding

It also accepts a min_periods argument.

df['c'] = df['a'].expanding(2).std()

The case for rolling was handled by Scott Boston, and it is unsurprisingly called rolling in Pandas.

The advantage if expanding over rolling(len(df), ...) is, you don't need to know the len in advance. It is very useful e.g. in groupby dataframes.

like image 30
Tomasz Gandor Avatar answered Oct 08 '22 05:10

Tomasz Gandor