Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - groupby columns with conditions from another column

I am struggling with pandas regarding how to group multiple column values with conditions:

Here is how my data looks like as a pandas dataframe:

id      trigger     timestamp
1       started     2017-10-01 14:00:1
1       ended       2017-10-04 12:00:1
2       started     2017-10-02 10:00:1
1       started     2017-10-03 11:00:1
2       ended       2017-10-04 12:00:1    
2       started     2017-10-05 15:00:1
1       ended       2017-10-05 16:00:1
2       ended       2017-10-05 17:00:1

My goal is to find the difference in day/hour or minutes between the dates grouped by the id.

My output should look more like this (diff in hrs):

id      trigger     timestamp           trigger     timestamp               diff
1       started     2017-10-01 14:00:1  ended       2017-10-04 12:00:1      70
1       started     2017-10-03 11:00:1  ended       2017-10-05 16:00:1      53
2       started     2017-10-02 10:00:1  ended       2017-10-04 12:00:1      26
2       started     2017-10-05 15:00:1  ended       2017-10-05 17:00:1      2

I have tried many options, but I can not the most efficient solution.

Here is my code until now:

First I tried to split the data in 'started' and 'ended':

df['started'] = df.groupby(['id', 'timestamp'])['trigger'] == 'started'

df['ended'] = df.groupby(['id', 'timestamp'])['trigger'] == 'ended'

and then:

df.groupby(['id', 'started', 'ended'], as_index=True).sum()

but it dind't work. or

df['started'] = df.groupby(['trigger'])['timestamp'].np.where(df['trigger']=='started')

also without gut results.

Can some point in the right direction how to do this with pandas? I will also have null matches in the data, how can I use df.fillna(method='ffill') to add NaN or missing data to the new dataframe.

like image 295
El_Patrón Avatar asked Feb 19 '18 23:02

El_Patrón


1 Answers

  1. Set id and trigger as the index
  2. Since the index contains duplicate entries, append another index column with the groupwise cumcount. Totally, df must have a MultiIndex with 3 columns
  3. unstack on timestamp
  4. Find the difference between the columns hourwise and assign the result back

df['timestamp'] = pd.to_datetime(df['timestamp']) # if necessary

i = df.groupby(['id', 'trigger']).cumcount()
df.set_index(['id', i, 'trigger']).timestamp.unstack().assign(
       diff=lambda d: d.ended.sub(d.started).dt.total_seconds() / 3600
)

Thanks to piRSquared for the improvement.

v

                  timestamp                      diff
trigger               ended             started      
id                                                   
1  0    2017-10-04 12:00:01 2017-10-01 14:00:01  70.0
   1    2017-10-05 16:00:01 2017-10-03 11:00:01  53.0
2  0    2017-10-04 12:00:01 2017-10-02 10:00:01  50.0
   1    2017-10-05 17:00:01 2017-10-05 15:00:01   2.0

The result is not exactly as depicted in your question, but I believe a MultiIndex of columns would be a cleaner way of representing your output instead of two trigger columns.

like image 114
cs95 Avatar answered Oct 15 '22 09:10

cs95