Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find days since last event pandas dataframe

Tags:

python

pandas

I have a pandas data frame:

df12 = pd.DataFrame({'group_ids':[1,1,1,2,2,2],'dates':['2016-04-01','2016-04-20','2016-04-28','2016-04-05','2016-04-20','2016-04-29'],'event_today_in_group':[1,0,1,1,1,0]})


   group_ids      dates  event_today_in_group
0          1 2016-04-01                     1
1          1 2016-04-20                     0
2          1 2016-04-28                     1
3          2 2016-04-05                     1
4          2 2016-04-20                     1
5          2 2016-04-29                     0

I would like to compute an additional column that contains, for each group_ids, the number of days since the last time event_today_in_group was 1.

 group_ids      dates  event_today_in_group  days_since_last_event
0          1 2016-04-01                     1                      0
1          1 2016-04-20                     0                     19
2          1 2016-04-28                     1                     27
3          2 2016-04-05                     1                      0
4          2 2016-04-20                     1                     15
5          2 2016-04-29                     0                      9
like image 423
Srikant Chari Avatar asked Jul 10 '17 21:07

Srikant Chari


Video Answer


1 Answers

As I mentioned earlier, this will get you the non-cumulative difference between dates within each group:

df['days_since_last_event'] = df.groupby('group_ids')['dates'].diff().apply(lambda x: x.days)

In order to get a cumulative sum of this difference, based on whenever event_today_in_group changes, I propose using shift to get the value of the previous row, and then generating a cumulative sum, like so:

df['event_today_in_group'].shift().cumsum()

Output:

0    NaN
1    1.0
2    1.0
3    2.0
4    3.0
5    4.0

This gives us the second grouping value we need to get the cumulative sums. You could assign the above values to a new column, but if you're only using them for the calculation, then you can simply include them in the subsequent groupby operation like so:

df.loc[:, 'days_since_last_event'] = df.groupby(['group_ids', df['event_today_in_group'].shift().cumsum()])['days_since_last_event'].cumsum()

Result:

   group_ids      dates  event_today_in_group  days_since_last_event
0          1 2016-04-01                     1                    NaN
1          1 2016-04-20                     0                   19.0
2          1 2016-04-28                     1                   27.0
3          2 2016-04-05                     1                    NaN
4          2 2016-04-20                     1                   15.0
5          2 2016-04-29                     0                    9.0
like image 127
cmaher Avatar answered Oct 19 '22 17:10

cmaher