Using pandas v1.0.1 and numpy 1.18.1, I want to calculate the rolling mean and std with different window sizes on a time series. In the data I am working with, the values can be constant for some subsequent points such that - depending on the window size - the rolling mean might be equal to all the values in the window and the corresponding std is expected to be 0.
However, I see a different behavior using the same df depending on the window size.
MWE:
for window in [3,5]:
values = [1234.0, 4567.0, 6800.0, 6810.0, 6821.0, 6820.0, 6820.0, 6820.0, 6820.0, 6820.0, 6820.0]
df = pd.DataFrame(values, columns=['values'])
df.loc[:, 'mean'] = df.rolling(window, min_periods=1).mean()
df.loc[:, 'std'] = df.rolling(window, min_periods=1).std(ddof=0)
print(df.info())
print(f'window: {window}')
print(df)
print('non-rolling result:', df['values'].iloc[len(df.index)-window:].values.std())
print('')
Output:
window: 3
values mean std
0 1234.0 1234.000000 0.000000
1 4567.0 2900.500000 1666.500000
2 6800.0 4200.333333 2287.053757
3 6810.0 6059.000000 1055.011216
4 6821.0 6810.333333 8.576454
5 6820.0 6817.000000 4.966555
6 6820.0 6820.333333 0.471405
7 6820.0 6820.000000 0.000000
8 6820.0 6820.000000 0.000000
9 6820.0 6820.000000 0.000000
10 6820.0 6820.000000 0.000000
non-rolling result: 0.0
window: 5
values mean std
0 1234.0 1234.000000 0.000000
1 4567.0 2900.500000 1666.500000
2 6800.0 4200.333333 2287.053757
3 6810.0 4852.750000 2280.329732
4 6821.0 5246.400000 2186.267193
5 6820.0 6363.600000 898.332366
6 6820.0 6814.200000 8.158431
7 6820.0 6818.200000 4.118252
8 6820.0 6820.200000 0.400000
9 6820.0 6820.000000 0.000021
10 6820.0 6820.000000 0.000021
non-rolling result: 0.0
As expected, the std is 0 for idx 7,8,9,10 using a window size of 3. For a window size of 5, I would expect idx 9 and 10 to yield 0. However, the result is different from 0.
If I calculate the std 'manually' for the last window of each window size (using idxs 8,9,10 and 6,7,8,9,10, respectively), I get the expected result of 0 for both cases.
Does anybody have an idea what could be the issue here? Any numerical caveats?
It seems that implementation of std()
in pd.rolling
prefers high performance over numerical accuracy. However You can apply np
version of standard deviation:
df.loc[:, 'std'] = df.rolling(window, min_periods=1).apply(np.std)
Result:
values std
0 1234.0 0.000000
1 4567.0 1666.500000
2 6800.0 2287.053757
3 6810.0 2280.329732
4 6821.0 2186.267193
5 6820.0 898.332366
6 6820.0 8.158431
7 6820.0 4.118252
8 6820.0 0.400000
9 6820.0 0.000000
10 6820.0 0.000000
Now precision is better.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With