Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas sum time interval in a group excluding overlaps

I have a dataframe where each row has a start time and an end time. Each row has also a group they belong to. I want to have a new column which gives the sum of all the time in that group in seconds.

For example if we have a group which looks like this:

         id1:    |----|
         id2:       |-----|
         id3:                      |--------|
                 .  .  .  .  .  .  .  .  .  .  .
time ->        12:00    12:04    12:07    12:10

then for all rows belonging to that group would get the summed time of 4+3 min = 420 seconds

If they overlap completely then we would get a scenario like this:

         id1:    |--------|
         id2:    |--------|
                 .  .  .  .  .  .  .  .  .  .  .
time ->        12:00    12:04    12:07    12:10

which would give us the result 4 min = 240 seconds.

Below is some dummy data:


import pandas as pd

ids = [x for x in range(10)]
group = [0, 1, 1, 2, 2, 3, 4, 4, 4, 4]

start = pd.to_datetime(["2019-10-21-16:20:00", "2019-10-21-16:22:00", "2019-10-21-16:22:00", "2019-10-21-16:15:00",
         "2019-10-21-16:22:00", "2019-10-21-16:58:00", "2019-10-21-17:02:00", "2019-10-21-17:03:00",
         "2019-10-21-17:04:00", "2019-10-21-17:20:00"])

end = pd.to_datetime(["2019-10-21-16:25:00", "2019-10-21-16:24:00", "2019-10-21-16:24:00", "2019-10-21-16:18:00",
       "2019-10-21-16:26:00", "2019-10-21-17:02:00", "2019-10-21-17:06:00", "2019-10-21-17:07:00",
       "2019-10-21-17:08:00", "2019-10-21-17:22:00"])

cols = ["id", "group", "start", "end"]


df = pd.DataFrame(dict(zip(cols, [ids, group, start, end])))

So far the approach I've tried is obviously not correct. I've tried grouping and then finding the min and max of each start and end of that group, and then just set that interval as the sum. This approach will not be correct as it will also include the gap in the interval.

gr = df.groupby("group").apply(lambda x : x.end.max() - x.start.min())
df['total_time'] = df.group.map(gr)
like image 332
Chicony Avatar asked Dec 17 '22 15:12

Chicony


1 Answers

First, add a column which tracks the latest end time seen so far (but only considering the same group):

df['notbefore'] = df.groupby('group').end.shift().cummax()

It's shifted by 1 so that it reflects the latest end time seen on previous rows, excluding the same row. It's important to have shift() before cummax(), otherwise the shift will "leak" values between groups.

Then add a column containing the "effective" start time:

df['effstart'] = df[['start', 'notbefore']].max(1)

This is the start time modified so that it is not before any previous end time (to avoid overlap).

Then compute the total seconds covered:

df['effsec'] = (df.end - df.effstart).clip(np.timedelta64(0))

df is now:

   id  group               start                 end           notbefore            effstart   effsec
0   0      0 2019-10-21 16:20:00 2019-10-21 16:25:00                 NaT 2019-10-21 16:20:00 00:05:00
1   1      1 2019-10-21 16:22:00 2019-10-21 16:24:00                 NaT 2019-10-21 16:22:00 00:02:00
2   2      1 2019-10-21 16:22:00 2019-10-21 16:24:00 2019-10-21 16:24:00 2019-10-21 16:24:00 00:00:00
3   3      2 2019-10-21 16:15:00 2019-10-21 16:18:00                 NaT 2019-10-21 16:15:00 00:03:00
4   4      2 2019-10-21 16:22:00 2019-10-21 16:26:00 2019-10-21 16:24:00 2019-10-21 16:24:00 00:02:00
5   5      3 2019-10-21 16:58:00 2019-10-21 17:02:00                 NaT 2019-10-21 16:58:00 00:04:00
6   6      4 2019-10-21 17:02:00 2019-10-21 17:06:00                 NaT 2019-10-21 17:02:00 00:04:00
7   7      4 2019-10-21 17:03:00 2019-10-21 17:07:00 2019-10-21 17:06:00 2019-10-21 17:06:00 00:01:00
8   8      4 2019-10-21 17:04:00 2019-10-21 17:08:00 2019-10-21 17:07:00 2019-10-21 17:07:00 00:01:00
9   9      4 2019-10-21 17:20:00 2019-10-21 17:22:00 2019-10-21 17:08:00 2019-10-21 17:20:00 00:02:00

To get the final results:

df.groupby('group').effsec.sum()

Which gives you:

group
0   00:05:00
1   00:02:00
2   00:05:00
3   00:04:00
4   00:08:00
like image 132
John Zwinck Avatar answered Dec 31 '22 19:12

John Zwinck