I have a dataframe where each row has a start time and an end time. Each row has also a group they belong to. I want to have a new column which gives the sum of all the time in that group in seconds.
For example if we have a group which looks like this:
id1: |----|
id2: |-----|
id3: |--------|
. . . . . . . . . . .
time -> 12:00 12:04 12:07 12:10
then for all rows belonging to that group would get the summed time of 4+3 min = 420 seconds
If they overlap completely then we would get a scenario like this:
id1: |--------|
id2: |--------|
. . . . . . . . . . .
time -> 12:00 12:04 12:07 12:10
which would give us the result 4 min = 240 seconds.
Below is some dummy data:
import pandas as pd
ids = [x for x in range(10)]
group = [0, 1, 1, 2, 2, 3, 4, 4, 4, 4]
start = pd.to_datetime(["2019-10-21-16:20:00", "2019-10-21-16:22:00", "2019-10-21-16:22:00", "2019-10-21-16:15:00",
"2019-10-21-16:22:00", "2019-10-21-16:58:00", "2019-10-21-17:02:00", "2019-10-21-17:03:00",
"2019-10-21-17:04:00", "2019-10-21-17:20:00"])
end = pd.to_datetime(["2019-10-21-16:25:00", "2019-10-21-16:24:00", "2019-10-21-16:24:00", "2019-10-21-16:18:00",
"2019-10-21-16:26:00", "2019-10-21-17:02:00", "2019-10-21-17:06:00", "2019-10-21-17:07:00",
"2019-10-21-17:08:00", "2019-10-21-17:22:00"])
cols = ["id", "group", "start", "end"]
df = pd.DataFrame(dict(zip(cols, [ids, group, start, end])))
So far the approach I've tried is obviously not correct. I've tried grouping and then finding the min and max of each start and end of that group, and then just set that interval as the sum. This approach will not be correct as it will also include the gap in the interval.
gr = df.groupby("group").apply(lambda x : x.end.max() - x.start.min())
df['total_time'] = df.group.map(gr)
First, add a column which tracks the latest end time seen so far (but only considering the same group):
df['notbefore'] = df.groupby('group').end.shift().cummax()
It's shifted by 1 so that it reflects the latest end time seen on previous rows, excluding the same row. It's important to have shift()
before cummax()
, otherwise the shift will "leak" values between groups.
Then add a column containing the "effective" start time:
df['effstart'] = df[['start', 'notbefore']].max(1)
This is the start time modified so that it is not before any previous end time (to avoid overlap).
Then compute the total seconds covered:
df['effsec'] = (df.end - df.effstart).clip(np.timedelta64(0))
df
is now:
id group start end notbefore effstart effsec
0 0 0 2019-10-21 16:20:00 2019-10-21 16:25:00 NaT 2019-10-21 16:20:00 00:05:00
1 1 1 2019-10-21 16:22:00 2019-10-21 16:24:00 NaT 2019-10-21 16:22:00 00:02:00
2 2 1 2019-10-21 16:22:00 2019-10-21 16:24:00 2019-10-21 16:24:00 2019-10-21 16:24:00 00:00:00
3 3 2 2019-10-21 16:15:00 2019-10-21 16:18:00 NaT 2019-10-21 16:15:00 00:03:00
4 4 2 2019-10-21 16:22:00 2019-10-21 16:26:00 2019-10-21 16:24:00 2019-10-21 16:24:00 00:02:00
5 5 3 2019-10-21 16:58:00 2019-10-21 17:02:00 NaT 2019-10-21 16:58:00 00:04:00
6 6 4 2019-10-21 17:02:00 2019-10-21 17:06:00 NaT 2019-10-21 17:02:00 00:04:00
7 7 4 2019-10-21 17:03:00 2019-10-21 17:07:00 2019-10-21 17:06:00 2019-10-21 17:06:00 00:01:00
8 8 4 2019-10-21 17:04:00 2019-10-21 17:08:00 2019-10-21 17:07:00 2019-10-21 17:07:00 00:01:00
9 9 4 2019-10-21 17:20:00 2019-10-21 17:22:00 2019-10-21 17:08:00 2019-10-21 17:20:00 00:02:00
To get the final results:
df.groupby('group').effsec.sum()
Which gives you:
group
0 00:05:00
1 00:02:00
2 00:05:00
3 00:04:00
4 00:08:00
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With