Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Sum Previous N Rows by Group

Tags:

python

pandas

I want to sum the prior N periods of data for each group. I have seen how to do each individually (sum by group, or sum prior N periods), but can't figure out a clean way to do both together.

I'm currently doing the following:

import pandas as pd

sample_data = {'user': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b'],\
                'clicks': [0,1,2,3,4,5,6,7,8,9]}
df = pd.DataFrame(sample_data)
df['clicks.1'] = df.groupby(['user'])['clicks'].shift(1)
df['clicks.2'] = df.groupby(['user'])['clicks'].shift(2)
df['clicks.3'] = df.groupby(['user'])['clicks'].shift(3)
df['total_clicks_prior3'] = df[['clicks.1','clicks.2', 'clicks.3']].sum(axis=1)

I don't want the 3 intermediate lagged columns, I just want the sum of those, so my desired output is:

>>> df[['clicks','user','total_clicks_prior3']]
   clicks user  total_clicks_prior3
0       0    a                  NaN
1       1    a                  0.0
2       2    a                  1.0
3       3    a                  3.0
4       4    a                  6.0
5       5    b                  NaN
6       6    b                  5.0
7       7    b                 11.0
8       8    b                 18.0
9       9    b                 21.0

Note: I could obviously drop the 3 columns after creating them, but given that I will be creating multiple columns of different numbers of lagged periods, I feel like there has to be an easier way.

like image 762
Tchotchke Avatar asked Sep 04 '19 17:09

Tchotchke


2 Answers

This is groupby + rolling + shift

df.groupby('user')['clicks'].rolling(3, min_periods=1).sum().groupby(level=0).shift()

user
a     0     NaN
      1     0.0
      2     1.0
      3     3.0
      4     6.0
b     5     NaN
      6     5.0
      7    11.0
      8    18.0
      9    21.0
Name: clicks, dtype: float64
like image 127
user3483203 Avatar answered Sep 28 '22 19:09

user3483203


If you have a solution that works for each group, you can use apply to use it on the groupby object. For instance, you linked to a question that has df['A'].rolling(min_periods=1, window=11).sum() as an answer. If that does what you want on the subgroups, you can do

df.groupby('user').apply(lambda x: x['clicks'].rolling(min_periods=1, window=11).sum())
like image 38
Acccumulation Avatar answered Sep 28 '22 19:09

Acccumulation