Pandas temporal cumulative sum by group

Tags:

I have a data frame where 1 or more events are recorded for each id. For each event the id, a metric x and a date are recorded. Something like this:

import pandas as pd
import datetime as dt
import numpy as np
x = range(0, 6)
id = ['a', 'a', 'b', 'a', 'b', 'b']
dates = [dt.datetime(2012, 5, 2),dt.datetime(2012, 4, 2),dt.datetime(2012, 6, 2),
         dt.datetime(2012, 7, 30),dt.datetime(2012, 4, 1),dt.datetime(2012, 5, 9)]

df =pd.DataFrame(np.column_stack((id,x,dates)), columns = ['id', 'x', 'dates'])

I'd like to be able to set a lookback period (i.e. 70 days) and calculate, for each row in the dataset, a cumulative sum of x for any preceding event for that id and within the desired lookback (excluding x for the row the calculation is being performed for). Should end up looking like:

  id  x                dates    want
0  a  0  2012-05-02 00:00:00    1
1  a  1  2012-04-02 00:00:00    0
2  b  2  2012-06-02 00:00:00    9
3  a  3  2012-07-30 00:00:00    0
4  b  4  2012-04-01 00:00:00    0
5  b  5  2012-05-09 00:00:00    4

481

asked May 23 '14 18:05

ADJ

2 Answers

I needed to perform something similar so I looked a bit and found in pandas' cookbook (which I warmly recommend to anyone willing to learn about all the great possibilities of this package) this page: Pandas: rolling mean by time interval. With the latest versions of pandas, you can pass an additional argument that will be used to calculate the window to the rolling() function based on a date_time like column. So the example becomes more straightforward:

# First, convert the dates to date time to make sure it's compatible
df['dates'] = pd.to_datetime(df['dates'])

# Then, sort the time series so that it is monotonic
df.sort_values(['id', 'dates'], inplace=True)

# '70d' corresponds to the the time window we are considering
# The 'closed' parameter indicates whether to include the interval bounds
# 'yearfirst' indicates to pandas the format of your time series
df['want'] = df.groupby('id').rolling('70d', on='dates', closed='neither'
    )['x'].sum().to_numpy()

df['want'] = np.where(df['want'].isnull(), 0, df['want']).astype(int)
df.sort_index() # to dispay it in the same order as the example provided
  id  x      dates  want
0  a  0 2012-05-02     1
1  a  1 2012-04-02     0
2  b  2 2012-06-02     9
3  a  3 2012-07-30     0
4  b  4 2012-04-01     0
5  b  5 2012-05-09     4

123

answered Sep 28 '22 07:09

Iqigai

Well, one approach is the following: (1) do a groupby/apply with 'id' as grouping variable. (2) Within the apply, resample the group to a daily time series. (3) Then just using rolling_sum (and shift so you don't include the current rows 'x' value) to compute the sum of your 70 day lookback periods. (4) Reduce the group back to only the original observations:

In [12]: df = df.sort(['id','dates'])
In [13]: df
Out[13]: 
  id  x      dates
1  a  1 2012-04-02
0  a  0 2012-05-02
3  a  3 2012-07-30
4  b  4 2012-04-01
5  b  5 2012-05-09
2  b  2 2012-06-02

You are going to need your data sorted by ['id','dates']. Now we can do the groupby/apply:

In [15]: def past70(g):
             g = g.set_index('dates').resample('D','last')
             g['want'] = pd.rolling_sum(g['x'],70,0).shift(1)
             return g[g.x.notnull()]            

In [16]: df = df.groupby('id').apply(past70).drop('id',axis=1)
In [17]: df
Out[17]: 
               x  want
id dates              
a  2012-04-02  1   NaN
   2012-05-02  0     1
   2012-07-30  3     0
b  2012-04-01  4   NaN
   2012-05-09  5     4
   2012-06-02  2     9

If you don't want the NaNs then just do:

In [28]: df.fillna(0)
Out[28]: 
               x  want
id dates              
a  2012-04-02  1     0
   2012-05-02  0     1
   2012-07-30  3     0
b  2012-04-01  4     0
   2012-05-09  5     4
   2012-06-02  2     9

Edit: If you want to make the lookback window a parameter do something like the following:

def past_window(g,win=70):
    g = g.set_index('dates').resample('D','last')
    g['want'] = pd.rolling_sum(g['x'],win,0).shift(1)
    return g[g.x.notnull()]            

df = df.groupby('id').apply(past_window,win=10)
print df.fillna(0)

answered Sep 28 '22 06:09

Karl D.

Related questions
                            
                                Use Python to extract Branch Lengths from Newick Format
                            
                                Python error: 'NoneType' object has no attribute 'find_all'
                            
                                splitting an RGB image to R,G,B channels - python
                            
                                numpy transform vector to binary matrix
                            
                                converting from svg to pdf
                            
                                try yield finally - did we raise an exception?
                            
                                How to order data in sqlalchemy by list
                            
                                Python: Memory efficient sort of a list of tuples by two elements
                            
                                EOFError with multiprocessing Manager
                            
                                Why django uses tuple of tuples to store static dictionaries and should i do the same?
                            
                                How can I specifiy the .spec file in PyInstaller
                            
                                Intercepting heapq
                            
                                Python AES implementations difference
                            
                                python + wsgi on a multi-threaded web-server: is this a race condition?
                            
                                djangojs makemessage fails - djangojs.pot: No such file or directory
                            
                                Pandas Count Unique occurrences by Month
                            
                                How can I do an interpolating reindex in pandas using datetime indices?
                            
                                Vectorised average K-Nearest Neighbour distance in Python
                            
                                Virtualenv and Anaconda issues
                            
                                How do I use cvxopt for mean variance optimization with constraints?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas temporal cumulative sum by group

Tags:

python

pandas

group-by

time-series

ADJ

People also ask

2 Answers

Iqigai

Karl D.

Recent Activity

Donate For Us