I am struggling with pandas regarding how to group multiple column values with conditions:
Here is how my data looks like as a pandas dataframe:
id trigger timestamp
1 started 2017-10-01 14:00:1
1 ended 2017-10-04 12:00:1
2 started 2017-10-02 10:00:1
1 started 2017-10-03 11:00:1
2 ended 2017-10-04 12:00:1
2 started 2017-10-05 15:00:1
1 ended 2017-10-05 16:00:1
2 ended 2017-10-05 17:00:1
My goal is to find the difference in day/hour or minutes between the dates grouped by the id.
My output should look more like this (diff in hrs):
id trigger timestamp trigger timestamp diff
1 started 2017-10-01 14:00:1 ended 2017-10-04 12:00:1 70
1 started 2017-10-03 11:00:1 ended 2017-10-05 16:00:1 53
2 started 2017-10-02 10:00:1 ended 2017-10-04 12:00:1 26
2 started 2017-10-05 15:00:1 ended 2017-10-05 17:00:1 2
I have tried many options, but I can not the most efficient solution.
Here is my code until now:
First I tried to split the data in 'started' and 'ended':
df['started'] = df.groupby(['id', 'timestamp'])['trigger'] == 'started'
df['ended'] = df.groupby(['id', 'timestamp'])['trigger'] == 'ended'
and then:
df.groupby(['id', 'started', 'ended'], as_index=True).sum()
but it dind't work. or
df['started'] = df.groupby(['trigger'])['timestamp'].np.where(df['trigger']=='started')
also without gut results.
Can some point in the right direction how to do this with pandas?
I will also have null matches in the data, how can I use df.fillna(method='ffill')
to add NaN or missing data to the new dataframe.
id
and trigger
as the indexdf
must have a MultiIndex
with 3 columnsunstack
on timestamp
df['timestamp'] = pd.to_datetime(df['timestamp']) # if necessary
i = df.groupby(['id', 'trigger']).cumcount()
df.set_index(['id', i, 'trigger']).timestamp.unstack().assign(
diff=lambda d: d.ended.sub(d.started).dt.total_seconds() / 3600
)
Thanks to piRSquared for the improvement.
v
timestamp diff
trigger ended started
id
1 0 2017-10-04 12:00:01 2017-10-01 14:00:01 70.0
1 2017-10-05 16:00:01 2017-10-03 11:00:01 53.0
2 0 2017-10-04 12:00:01 2017-10-02 10:00:01 50.0
1 2017-10-05 17:00:01 2017-10-05 15:00:01 2.0
The result is not exactly as depicted in your question, but I believe a MultiIndex
of columns would be a cleaner way of representing your output instead of two trigger columns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With