I have a pandas DataFrame as shown below. The columns are date, color, time, and duration (in seconds). I need to calculate the amount of time throughout a day that we are displaying a color.
date color start time duration(seconds)
2021-07-06 RED 11:00:00.00 5
2021-07-06 RED 11:00:00.00 9
2021-07-06 BLUE 11:00:00.00 3
2021-07-06 RED 11:00:00.00 3
2021-07-06 BLUE 12:00:00.00 10
2021-07-06 BLUE 12:00:00.00 7
2021-07-06 RED 12:00:00.00 9
2021-07-06 BLUE 12:00:00.00 5
2021-07-06 RED 12:00:00.00 1
2021-07-06 RED 12:00:00.00 2
If we're looking at the color red in the example above, the duration of displaying red would be 18 seconds. We don't double count any display overlaps.
My desired output would be a DataFrame which tells me how long each color was displayed for, and how long all colors were displayed. The maximum amount of time for each color, or any color can only be 24 hours. For the example above, the answer would be:
Red Duration: 18 seconds
Blue Duration: 13 seconds
Total Duration: 19 seconds
How would I go about doing this?
There is a solution, similar to the one with staircase which involves interval arrays instead of step functions. It uses a package piso which is built for set operations with pandas interval classes
setup
Assume the same setup as staircase solution
solution
For each colour create pandas.arrays.IntervalArrays (or pandas.IntervalIndex)
import piso
interval_arrays = df.groupby("color").apply(lambda d: pd.arrays.IntervalArray.from_arrays(d["start"], d["end"]))
interval_arrays looks like this
color
BLUE [(2021-07-06 11:00:01, 2021-07-06 11:00:08], (...
RED [(2021-07-06 11:00:00, 2021-07-06 11:00:05], (...
dtype: object
We need to create the union of these intervals in each array
interval_arrays = interval_arrays.apply(piso.union)
We then create the union of these two arrays to get the total
interval_arrays["TOTAL"] = piso.union(*interval_arrays)
interval_arrays looks like this
color
BLUE [(2021-07-06 11:00:01, 2021-07-06 11:00:08], (...
RED [(2021-07-06 11:00:00, 2021-07-06 11:00:05], (...
TOTAL [(2021-07-06 11:00:00, 2021-07-06 11:00:17]]
dtype: object
create bins as a pandas.IntervalIndex
bins = pd.date_range(pd.Timestamp("2021-07-06 11:00:00"), freq = "5s", periods=4)
ii_bins = pd.IntervalIndex.from_breaks(bins)
Then use piso.coverage() which takes an interval array, and a domain, and returns the fraction of the domain (i.e. bin) covered by the intervals in the array. If we multiply the fraction by the bin size then it will be the total time
interval_arrays.apply(lambda ia: pd.Series([piso.coverage(ia, bin)*bin.length for bin in ii_bins]))
This results in a dataframe
color 0 1 2
BLUE 0 days 00:00:04 0 days 00:00:05 0 days 00:00:04
RED 0 days 00:00:05 0 days 00:00:04 0 days 00:00:05
TOTAL 0 days 00:00:05 0 days 00:00:05 0 days 00:00:05
columns are bin indices. You can switch them out for the intervals ii_bins if you want to, and perhaps melt the dataframe to get tidy data.
This approach also handles overlaps and handles intervals crossing bin boundaries.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With