Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditional expanding group aggregation pandas

Tags:

python

pandas

For some data preprocessing I have a huge dataframe where I need historical performance within groups. However since it is for a predictive model that runs a week before the target I cannot use any data that happened in that week in between. There are a variable number of rows per day per group which means I cannot always discard the last 7 values by using a shift on the expanding functions, I have to somehow condition on the datetime of rows before it. I can write my own function to apply on the groups however this is usually very slow in my experience (albeit flexible). This is how I did it without conditioning on date and just looking at previous records:

df.loc[:, 'new_col'] = df_gr['old_col'].apply(lambda x: x.expanding(5).mean().shift(1))

The 5 represents that I want at least a sample size of 5 or to put it to NaN.

Small example with aggr_mean looking at the mean of all samples within group A at least a week earlier:

group | dt       | value  | aggr_mean
A     | 01-01-16 | 5      | NaN
A     | 03-01-16 | 4      | NaN
A     | 08-01-16 | 12     | 5 (only looks at first row)
A     | 17-01-16 | 11     | 7 (looks at first three rows since all are 
                               at least a week earlier)
like image 615
Jan van der Vegt Avatar asked Jan 05 '23 22:01

Jan van der Vegt


1 Answers

new answer
using @JulienMarrec's better example

dt           group  value   
2016-01-01     A      5
2016-01-03     A      4
2016-01-08     A     12
2016-01-17     A     11
2016-01-04     B     10
2016-01-05     B      5
2016-01-08     B     12
2016-01-17     B     11

Condition df to be more useful

d1 = df.drop('group', 1)
d1.index = [df.group, df.groupby('group').cumcount().rename('gidx')]
d1

enter image description here

create a custom function that does what old answer did. Then apply it within groupby

def lag_merge_asof(df, lag):
    d = df.set_index('dt').value.expanding().mean()
    d.index = d.index + pd.offsets.Day(lag)
    d = d.reset_index(name='aggr_mean')
    return pd.merge_asof(df, d)

d1.groupby(level='group').apply(lag_merge_asof, lag=7)

enter image description here

we can get some formatting with this

d1.groupby(level='group').apply(lag_merge_asof, lag=7) \
    .reset_index('group').reset_index(drop=True)

enter image description here


old answer

create a lookback dataframe by offsetting the dates by 7 days, then use it to pd.merge_asof

lookback = df.set_index('dt').value.expanding().mean()

lookback.index += pd.offsets.Day(7)
lookback = lookback.reset_index(name='aggr_mean')

lookback

enter image description here

pd.merge_asof(df, lookback, left_on='dt', right_on='dt')

enter image description here

like image 73
piRSquared Avatar answered Jan 13 '23 12:01

piRSquared