Pandas sum time interval in a group excluding overlaps

Question

I have a dataframe where each row has a start time and an end time. Each row has also a group they belong to. I want to have a new column which gives the sum of all the time in that group in seconds.

For example if we have a group which looks like this:

         id1:    |----|
         id2:       |-----|
         id3:                      |--------|
                 .  .  .  .  .  .  .  .  .  .  .
time ->        12:00    12:04    12:07    12:10

then for all rows belonging to that group would get the summed time of 4+3 min = 420 seconds

If they overlap completely then we would get a scenario like this:

         id1:    |--------|
         id2:    |--------|
                 .  .  .  .  .  .  .  .  .  .  .
time ->        12:00    12:04    12:07    12:10

which would give us the result 4 min = 240 seconds.

Below is some dummy data:

import pandas as pd

ids = [x for x in range(10)]
group = [0, 1, 1, 2, 2, 3, 4, 4, 4, 4]

start = pd.to_datetime(["2019-10-21-16:20:00", "2019-10-21-16:22:00", "2019-10-21-16:22:00", "2019-10-21-16:15:00",
         "2019-10-21-16:22:00", "2019-10-21-16:58:00", "2019-10-21-17:02:00", "2019-10-21-17:03:00",
         "2019-10-21-17:04:00", "2019-10-21-17:20:00"])

end = pd.to_datetime(["2019-10-21-16:25:00", "2019-10-21-16:24:00", "2019-10-21-16:24:00", "2019-10-21-16:18:00",
       "2019-10-21-16:26:00", "2019-10-21-17:02:00", "2019-10-21-17:06:00", "2019-10-21-17:07:00",
       "2019-10-21-17:08:00", "2019-10-21-17:22:00"])

cols = ["id", "group", "start", "end"]


df = pd.DataFrame(dict(zip(cols, [ids, group, start, end])))

So far the approach I've tried is obviously not correct. I've tried grouping and then finding the min and max of each start and end of that group, and then just set that interval as the sum. This approach will not be correct as it will also include the gap in the interval.

gr = df.groupby("group").apply(lambda x : x.end.max() - x.start.min())
df['total_time'] = df.group.map(gr)

John Zwinck · Accepted Answer

First, add a column which tracks the latest end time seen so far (but only considering the same group):

df['notbefore'] = df.groupby('group').end.shift().cummax()

It's shifted by 1 so that it reflects the latest end time seen on previous rows, excluding the same row. It's important to have shift() before cummax(), otherwise the shift will "leak" values between groups.

Then add a column containing the "effective" start time:

df['effstart'] = df[['start', 'notbefore']].max(1)

This is the start time modified so that it is not before any previous end time (to avoid overlap).

Then compute the total seconds covered:

df['effsec'] = (df.end - df.effstart).clip(np.timedelta64(0))

df is now:

   id  group               start                 end           notbefore            effstart   effsec
0   0      0 2019-10-21 16:20:00 2019-10-21 16:25:00                 NaT 2019-10-21 16:20:00 00:05:00
1   1      1 2019-10-21 16:22:00 2019-10-21 16:24:00                 NaT 2019-10-21 16:22:00 00:02:00
2   2      1 2019-10-21 16:22:00 2019-10-21 16:24:00 2019-10-21 16:24:00 2019-10-21 16:24:00 00:00:00
3   3      2 2019-10-21 16:15:00 2019-10-21 16:18:00                 NaT 2019-10-21 16:15:00 00:03:00
4   4      2 2019-10-21 16:22:00 2019-10-21 16:26:00 2019-10-21 16:24:00 2019-10-21 16:24:00 00:02:00
5   5      3 2019-10-21 16:58:00 2019-10-21 17:02:00                 NaT 2019-10-21 16:58:00 00:04:00
6   6      4 2019-10-21 17:02:00 2019-10-21 17:06:00                 NaT 2019-10-21 17:02:00 00:04:00
7   7      4 2019-10-21 17:03:00 2019-10-21 17:07:00 2019-10-21 17:06:00 2019-10-21 17:06:00 00:01:00
8   8      4 2019-10-21 17:04:00 2019-10-21 17:08:00 2019-10-21 17:07:00 2019-10-21 17:07:00 00:01:00
9   9      4 2019-10-21 17:20:00 2019-10-21 17:22:00 2019-10-21 17:08:00 2019-10-21 17:20:00 00:02:00

To get the final results:

df.groupby('group').effsec.sum()

Which gives you:

group
0   00:05:00
1   00:02:00
2   00:05:00
3   00:04:00
4   00:08:00

Pandas sum time interval in a group excluding overlaps

Tags:

python

python-3.x

pandas

pandas-groupby

Chicony

1 Answers

John Zwinck

Recent Activity

Donate For Us

Pandas sum time interval in a group excluding overlaps

Tags:

python

python-3.x

pandas

pandas-groupby

Chicony

1 Answers

John Zwinck

Related questions

Recent Activity

Donate For Us