Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to have pandas perform a rolling average on a non-uniform x-grid

I would like to perform a rolling average but with a window that only has a finite 'vision' in x. I would like something similar to what I have below, but I want a window range that based on the x value rather than position index.

While doing this within pandas is preferred numpy/scipy equivalents are also OK

import numpy as np 
import pandas as pd 

x_val = [1,2,4,8,16,32,64,128,256,512]
y_val = [x+np.random.random()*200 for x in x_val]

df = pd.DataFrame(data={'x':x_val,'y':y_val})
df.set_index('x', inplace=True)

df.plot()
df.rolling(1, win_type='gaussian').mean(std=2).plot()

So I would expect the first 5 values to be averaged together because they are within 10 xunits of each other, but the last values to be unchanged.

like image 879
Daniel Marchand Avatar asked Nov 18 '20 16:11

Daniel Marchand


People also ask

What is Min_periods in rolling?

min_periods : Minimum number of observations in window required to have a value (otherwise result is NA). For a window that is specified by an offset, this will default to 1.

How does pandas calculate average of columns?

To calculate the mean of whole columns in the DataFrame, use pandas. Series. mean() with a list of DataFrame columns. You can also get the mean for all numeric columns using DataFrame.


2 Answers

According to pandas documentation on rolling

Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.

Therefore, maybe you need to fake a rolling operation with various window sizes like this

test_df = pd.DataFrame({'x':np.linspace(1,10,10),'y':np.linspace(1,10,10)})
test_df['win_locs'] = np.linspace(1,10,10).astype('object')
for ind in range(10): test_df.at[ind,'win_locs'] = np.random.randint(0,10,np.random.randint(5)).tolist()

    
# rolling operation with various window sizes
def worker(idx_list):
    
    x_slice = test_df.loc[idx_list,'x']
    return np.sum(x_slice)

test_df['rolling'] = test_df['win_locs'].apply(worker)

As you can see, test_df is

      x     y      win_locs  rolling
0   1.0   1.0        [5, 2]      9.0
1   2.0   2.0  [4, 8, 7, 1]     24.0
2   3.0   3.0            []      0.0
3   4.0   4.0           [9]     10.0
4   5.0   5.0     [6, 2, 9]     20.0
5   6.0   6.0            []      0.0
6   7.0   7.0     [5, 7, 9]     24.0
7   8.0   8.0            []      0.0
8   9.0   9.0            []      0.0
9  10.0  10.0  [9, 4, 7, 1]     25.0

where the rolling operation is achieved with apply method.

However, this approach is significantly slower than the native rolling, for example,

test_df = pd.DataFrame({'x':np.linspace(1,10,10),'y':np.linspace(1,10,10)})
test_df['win_locs'] = np.linspace(1,10,10).astype('object')
for ind in range(10): test_df.at[ind,'win_locs'] = np.arange(ind-1,ind+1).tolist() if ind >= 1 else []

using the approach above

%%timeit
# rolling operation with various window sizes
def worker(idx_list):
    
    x_slice = test_df.loc[idx_list,'x']
    return np.sum(x_slice)

test_df['rolling_apply'] = test_df['win_locs'].apply(worker)

the result is

41.4 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

while using native rolling is ~x50 faster

%%timeit
test_df['rolling_native'] = test_df['x'].rolling(window=2).sum()

863 µs ± 118 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
like image 197
meTchaikovsky Avatar answered Nov 15 '22 00:11

meTchaikovsky


The key question remains: what do you want to achieve with the rolling mean ?

Mathematically a clean way is:

  1. interpolate to the finest dx of the x-data
  2. perform the rolling mean
  3. take out the data points you want (But be careful: this step is a type of averaging too!)

Here is the code for the interpolation:

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d

x_val = [1,2,4,8,16,32,64,128,256,512]
y_val = [x+np.random.random()*200 for x in x_val]

df = pd.DataFrame(data={'x':x_val,'y':y_val})
df.set_index('x', inplace=True)

#df.plot()
df.rolling(5, win_type='gaussian').mean(std=200).plot()


#---- Interpolation -----------------------------------
f1 = interp1d(x_val, y_val)
f2 = interp1d(x_val, y_val, kind='cubic')

dx = np.diff(x_val).min()  # get the smallest dx in the x-data set

xnew = np.arange(x_val[0], x_val[-1]+dx, step=dx)
ynew1 = f1(xnew)
ynew2 = f2(xnew)

#---- plot ---------------------------------------------
fig = plt.figure(figsize=(15,5))
plt.plot(x_val, y_val, '-o', label='data', alpha=0.5)
plt.plot(xnew, ynew1, '|', ms = 15, c='r', label='linear', zorder=1)
#plt.plot(xnew, ynew2, label='cubic')
plt.savefig('curve.png')
plt.legend(loc='best')
plt.show()

enter image description here

like image 39
pyano Avatar answered Nov 15 '22 01:11

pyano