Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Time-based .rolling() fails with group by

Here's a code snippet from Pandas Issue #13966

dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.concatenate((dates, dates)),
                   'C': np.arange(40)})

Fails:

df.groupby('A').rolling('4s', on='B').C.mean()
ValueError: B must be monotonic

Per the issue linked above, this seems to be a bug. Does anyone have a good workaround?

like image 546
Huey Avatar asked Nov 26 '16 18:11

Huey


1 Answers

Set B as the index first so as to use Groupby.resample method on it.

df.set_index('B', inplace=True)

Groupby A and resample based on seconds frequency. As resample cannot be directly used with rolling, use ffill(forward fillna with NaN limit as 0). Now use rolling function by specifying the window size as 4 (because of freq=4s) interval and take it's mean along C column as shown:

for _, grp in df.groupby('A'):
    print (grp.resample('s').ffill(limit=0).rolling(4)['C'].mean().head(10)) #Remove head() 

Resulting output obtained:

B
2016-01-01 09:30:00    NaN
2016-01-01 09:30:01    NaN
2016-01-01 09:30:02    NaN
2016-01-01 09:30:03    1.5
2016-01-01 09:30:04    2.5
2016-01-01 09:30:05    3.5
2016-01-01 09:30:06    4.5
2016-01-01 09:30:07    5.5
2016-01-01 09:30:08    6.5
2016-01-01 09:30:09    7.5
Freq: S, Name: C, dtype: float64
B
2016-01-01 09:30:00     NaN
2016-01-01 09:30:01     NaN
2016-01-01 09:30:02     NaN
2016-01-01 09:30:03    21.5
2016-01-01 09:30:04    22.5
2016-01-01 09:30:05    23.5
2016-01-01 09:30:06    24.5
2016-01-01 09:30:07    25.5
2016-01-01 09:30:08    26.5
2016-01-01 09:30:09    27.5
Freq: S, Name: C, dtype: float64
B
2016-01-01 09:30:12     NaN
2016-01-01 09:30:13     NaN
2016-01-01 09:30:14     NaN
2016-01-01 09:30:15    33.5
2016-01-01 09:30:16    34.5
2016-01-01 09:30:17    35.5
2016-01-01 09:30:18    36.5
2016-01-01 09:30:19    37.5
Freq: S, Name: C, dtype: float64

TL;DR

Use groupby.apply as a workaround instead after setting the index appropriately:

# tested in version - 0.19.1
df.groupby('A').apply(lambda grp: grp.resample('s').ffill(limit=0).rolling(4)['C'].mean())

(Or)

# Tested in OP's version - 0.19.0
df.groupby('A').apply(lambda grp: grp.resample('s').ffill().rolling(4)['C'].mean())

Both work.

like image 161
Nickil Maveli Avatar answered Nov 04 '22 05:11

Nickil Maveli