Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Total Duration (no double counting) - Python - Pandas

I have a pandas DataFrame as shown below. The columns are date, color, time, and duration (in seconds). I need to calculate the amount of time throughout a day that we are displaying a color.

date          color          start time            duration(seconds)
2021-07-06    RED            11:00:00.00           5
2021-07-06    RED            11:00:00.00           9
2021-07-06    BLUE           11:00:00.00           3
2021-07-06    RED            11:00:00.00           3
2021-07-06    BLUE           12:00:00.00           10
2021-07-06    BLUE           12:00:00.00           7
2021-07-06    RED            12:00:00.00           9
2021-07-06    BLUE           12:00:00.00           5
2021-07-06    RED            12:00:00.00           1
2021-07-06    RED            12:00:00.00           2

For example, in a 24 hour day I need to understand how long we are displaying a color. There will be a variable number of colors each day, staggered start times, and varying durations.

If we're looking at the color red in the example above, the duration of displaying red would be 18 seconds. We don't double count any display overlaps.

My desired output would be a DataFrame which tells me how long each color was displayed for, and how long all colors were displayed. The maximum amount of time for each color, or any color can only be 24 hours. For the example above, the answer would be:

Red Duration: 18 seconds
Blue Duration: 13 seconds
Total Duration: 19 seconds

How would I go about doing this?

like image 884
left-them-on-red Avatar asked May 18 '26 18:05

left-them-on-red


1 Answers

There is a solution, similar to the one with staircase which involves interval arrays instead of step functions. It uses a package piso which is built for set operations with pandas interval classes

setup

Assume the same setup as staircase solution

solution

For each colour create pandas.arrays.IntervalArrays (or pandas.IntervalIndex)

import piso

interval_arrays = df.groupby("color").apply(lambda d: pd.arrays.IntervalArray.from_arrays(d["start"], d["end"]))

interval_arrays looks like this

color
BLUE     [(2021-07-06 11:00:01, 2021-07-06 11:00:08], (...
RED      [(2021-07-06 11:00:00, 2021-07-06 11:00:05], (...
dtype: object

We need to create the union of these intervals in each array

interval_arrays = interval_arrays.apply(piso.union)

We then create the union of these two arrays to get the total

interval_arrays["TOTAL"] = piso.union(*interval_arrays)

interval_arrays looks like this

color
BLUE     [(2021-07-06 11:00:01, 2021-07-06 11:00:08], (...
RED      [(2021-07-06 11:00:00, 2021-07-06 11:00:05], (...
TOTAL         [(2021-07-06 11:00:00, 2021-07-06 11:00:17]]
dtype: object

create bins as a pandas.IntervalIndex

bins = pd.date_range(pd.Timestamp("2021-07-06 11:00:00"), freq = "5s", periods=4)
ii_bins = pd.IntervalIndex.from_breaks(bins)

Then use piso.coverage() which takes an interval array, and a domain, and returns the fraction of the domain (i.e. bin) covered by the intervals in the array. If we multiply the fraction by the bin size then it will be the total time

interval_arrays.apply(lambda ia: pd.Series([piso.coverage(ia, bin)*bin.length for bin in ii_bins]))

This results in a dataframe

color               0               1               2                                                
BLUE  0 days 00:00:04 0 days 00:00:05 0 days 00:00:04
RED   0 days 00:00:05 0 days 00:00:04 0 days 00:00:05
TOTAL 0 days 00:00:05 0 days 00:00:05 0 days 00:00:05

columns are bin indices. You can switch them out for the intervals ii_bins if you want to, and perhaps melt the dataframe to get tidy data.

This approach also handles overlaps and handles intervals crossing bin boundaries.

like image 63
Riley Avatar answered May 20 '26 09:05

Riley



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!