Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to pass dataframe column value as window size after df.groupby?

    A   B   C
0   1   10  2
1   1   15  2
2   1   14  2
3   2   11  4
4   2   12  4
5   2   13  4
6   2   16  4
7   1   18  2

This is my sample DataFrame.

  1. I want to apply groupby on column 'A',

  2. Apply rolling sum on column 'B' based on the value of column 'C', means when A is 1 so window size should be 2 and instead of NaN I want the sum of remaining values regardless of window size.

Currently my output is:

A   
1  0    25.0
   1    29.0
   2    32.0
   7     NaN
2  3    23.0
   4    25.0
   5    29.0
   6     NaN

code for above: df['B'].groupby(df['A']).rolling(df['C'][0]).sum().shift(-1)

when C = 4 , I want the window of rolling to be 4 and dont want NaN

The desired output should be as follows:

    A   B   C   Rolling_sum
0   1   10  2   25
1   1   15  2   29
2   1   14  2   32
7   1   18  2   18
3   2   11  4   52
4   2   12  4   41
5   2   13  4   29
6   2   16  4   16
like image 787
Asma Damani Avatar asked Dec 26 '19 09:12

Asma Damani


Video Answer


1 Answers

Because you want pass dynamic window by column C use lambda function with change order by iloc[::-1]:

df = df.sort_values('A')
df['Rolling_sum'] = (df.iloc[::-1].groupby('A')
                       .apply(lambda x: x.B.rolling(x.C.iat[0], min_periods=0).sum())
                       .reset_index(level=0, drop=True))
print (df)
   A   B  C  Rolling_sum
0  1  10  2         25.0
1  1  15  2         29.0
2  1  14  2         32.0
7  1  18  2         18.0
3  2  11  4         52.0
4  2  12  4         41.0
5  2  13  4         29.0
6  2  16  4         16.0

Solution with strides if performance is important (depends of number of groups, size of groups, the best test in real data):

def rolling_window(a, window):
    a = np.concatenate([[0] * (window - 1), a])
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides).sum(axis=1)

df = df.sort_values('A')
df['Rolling_sum']  = (df.iloc[::-1].groupby('A')
                        .apply(lambda x: pd.Series(rolling_window(x.B, x.C.iat[0]), 
                                                   index=x.index))
                        .reset_index(level=0, drop=True))
print (df) 
   A   B  C  Rolling_sum
0  1  10  2           25
1  1  15  2           29
2  1  14  2           32
7  1  18  2           18
3  2  11  4           52
4  2  12  4           41
5  2  13  4           29
6  2  16  4           16
like image 172
jezrael Avatar answered Sep 29 '22 15:09

jezrael