Using pandas v1.0.1 and numpy 1.18.1, I want to calculate the rolling mean and std with different window sizes on a time series. In the data I am working with, the values can be constant for some subsequent points such that - depending on the window size - the rolling mean might be equal to all the values in the window and the corresponding std is expected to be 0.
However, I see a different behavior using the same df depending on the window size.
MWE:
for window in [3,5]:
    values = [1234.0, 4567.0, 6800.0, 6810.0, 6821.0, 6820.0, 6820.0, 6820.0, 6820.0, 6820.0, 6820.0]
    df = pd.DataFrame(values, columns=['values'])
    df.loc[:, 'mean'] = df.rolling(window, min_periods=1).mean()
    df.loc[:, 'std'] = df.rolling(window, min_periods=1).std(ddof=0)
    print(df.info())
    print(f'window: {window}')
    print(df)
    print('non-rolling result:', df['values'].iloc[len(df.index)-window:].values.std())
    print('')
Output:
window: 3
    values         mean          std
0   1234.0  1234.000000     0.000000
1   4567.0  2900.500000  1666.500000
2   6800.0  4200.333333  2287.053757
3   6810.0  6059.000000  1055.011216
4   6821.0  6810.333333     8.576454
5   6820.0  6817.000000     4.966555
6   6820.0  6820.333333     0.471405
7   6820.0  6820.000000     0.000000
8   6820.0  6820.000000     0.000000
9   6820.0  6820.000000     0.000000
10  6820.0  6820.000000     0.000000
non-rolling result: 0.0
window: 5
    values         mean          std
0   1234.0  1234.000000     0.000000
1   4567.0  2900.500000  1666.500000
2   6800.0  4200.333333  2287.053757
3   6810.0  4852.750000  2280.329732
4   6821.0  5246.400000  2186.267193
5   6820.0  6363.600000   898.332366
6   6820.0  6814.200000     8.158431
7   6820.0  6818.200000     4.118252
8   6820.0  6820.200000     0.400000
9   6820.0  6820.000000     0.000021
10  6820.0  6820.000000     0.000021
non-rolling result: 0.0
As expected, the std is 0 for idx 7,8,9,10 using a window size of 3. For a window size of 5, I would expect idx 9 and 10 to yield 0. However, the result is different from 0.
If I calculate the std 'manually' for the last window of each window size (using idxs 8,9,10 and 6,7,8,9,10, respectively), I get the expected result of 0 for both cases.
Does anybody have an idea what could be the issue here? Any numerical caveats?
It seems that implementation of std() in pd.rolling prefers high performance over numerical accuracy. However You can apply np version of standard deviation:
df.loc[:, 'std'] = df.rolling(window, min_periods=1).apply(np.std)
Result:
    values          std
0   1234.0     0.000000
1   4567.0  1666.500000
2   6800.0  2287.053757
3   6810.0  2280.329732
4   6821.0  2186.267193
5   6820.0   898.332366
6   6820.0     8.158431
7   6820.0     4.118252
8   6820.0     0.400000
9   6820.0     0.000000
10  6820.0     0.000000
Now precision is better.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With