There are some questions out there regarding finding the overlap in date or time ranges (for example). I've used these to solve my problem, but I've ended up with an extremely slow (and not-at-all elegant) solution to my problem. I would really appreciate it if someone has an idea of how to make this faster (and more elegant):
The Problem:
I've got 2 dataframes, df1
and df2
, each with 2 columns that represent a start time and an end time:
>>> df1
datetime_start datetime_end
0 2016-09-11 06:00:00 2016-09-11 06:30:00
1 2016-09-11 07:00:00 2016-09-11 07:30:00
2 2016-09-11 07:30:00 2016-09-11 08:00:00
3 2016-09-11 08:00:00 2016-09-11 08:30:00
4 2016-09-11 08:30:00 2016-09-11 09:00:00
5 2016-09-11 09:00:00 2016-09-11 09:30:00
6 2016-09-11 09:30:00 2016-09-11 10:00:00
7 2016-09-11 10:30:00 2016-09-11 11:00:00
13 2016-09-11 14:00:00 2016-09-11 14:30:00
14 2016-09-11 14:30:00 2016-09-11 15:00:00
15 2016-09-11 15:00:00 2016-09-11 15:30:00
16 2016-09-11 15:30:00 2016-09-11 16:00:00
17 2016-09-11 16:00:00 2016-09-11 16:30:00
18 2016-09-11 16:30:00 2016-09-11 17:00:00
19 2016-09-11 17:00:00 2016-09-11 17:30:00
>>> df2
datetime_start datetime_end catg
4 2016-09-11 08:48:33 2016-09-11 09:41:53 a
6 2016-09-11 09:54:25 2016-09-11 10:00:50 a
8 2016-09-11 10:01:47 2016-09-11 10:04:55 b
10 2016-09-11 10:08:00 2016-09-11 10:08:11 b
12 2016-09-11 10:30:28 2016-09-11 10:30:28 b
14 2016-09-11 10:38:18 2016-09-11 10:38:18 a
18 2016-09-11 13:44:05 2016-09-11 13:44:05 a
20 2016-09-11 13:46:52 2016-09-11 14:11:41 d
23 2016-09-11 14:22:17 2016-09-11 14:33:40 b
25 2016-09-11 15:00:12 2016-09-11 15:02:55 b
27 2016-09-11 15:04:19 2016-09-11 15:06:36 b
29 2016-09-11 15:08:43 2016-09-11 15:31:29 d
31 2016-09-11 15:38:04 2016-09-11 16:09:24 a
33 2016-09-11 16:18:40 2016-09-11 16:44:32 b
35 2016-09-11 16:45:59 2016-09-11 16:59:01 b
37 2016-09-11 17:08:31 2016-09-11 17:12:23 b
39 2016-09-11 17:16:13 2016-09-11 17:16:33 c
41 2016-09-11 17:17:23 2016-09-11 17:20:00 b
45 2016-09-13 12:27:59 2016-09-13 12:34:21 a
47 2016-09-13 12:38:39 2016-09-13 12:38:45 a
What I want is to find where the ranges in df2
have overlap with the ranges in df1
, how long that overlap is (in seconds), and what value of df2.catg
that is. I want the length of that overlap inserted into a column in df1
(that column will be named for the catg
it represents).
Desired output:
>>> df1
datetime_start datetime_end a b d c
0 2016-09-11 06:00:00 2016-09-11 06:30:00 0.0 0.0 0.0 0.0
1 2016-09-11 07:00:00 2016-09-11 07:30:00 0.0 0.0 0.0 0.0
2 2016-09-11 07:30:00 2016-09-11 08:00:00 0.0 0.0 0.0 0.0
3 2016-09-11 08:00:00 2016-09-11 08:30:00 0.0 0.0 0.0 0.0
4 2016-09-11 08:30:00 2016-09-11 09:00:00 687.0 0.0 0.0 0.0
5 2016-09-11 09:00:00 2016-09-11 09:30:00 1800.0 0.0 0.0 0.0
6 2016-09-11 09:30:00 2016-09-11 10:00:00 1048.0 0.0 0.0 0.0
7 2016-09-11 10:30:00 2016-09-11 11:00:00 0.0 0.0 0.0 0.0
13 2016-09-11 14:00:00 2016-09-11 14:30:00 0.0 463.0 701.0 0.0
14 2016-09-11 14:30:00 2016-09-11 15:00:00 0.0 220.0 0.0 0.0
15 2016-09-11 15:00:00 2016-09-11 15:30:00 0.0 300.0 1277.0 0.0
16 2016-09-11 15:30:00 2016-09-11 16:00:00 1316.0 0.0 89.0 0.0
17 2016-09-11 16:00:00 2016-09-11 16:30:00 564.0 680.0 0.0 0.0
18 2016-09-11 16:30:00 2016-09-11 17:00:00 0.0 1654.0 0.0 0.0
19 2016-09-11 17:00:00 2016-09-11 17:30:00 0.0 389.0 0.0 20.0
The ridiculously slow way to do this:
Based on this beautiful answer, I've achieved the goals I want using the following hard to follow code:
from collections import namedtuple
Range = namedtuple('Range', ['start', 'end'])
def overlap(row1, row2):
r1 = Range(start=row1.datetime_start, end=row1.datetime_end)
r2 = Range(start=row2.datetime_start, end=row2.datetime_end)
latest_start = max(r1.start, r2.start)
earliest_end = min(r1.end, r2.end)
delta = (earliest_end - latest_start).total_seconds()
overlap = max(0, delta)
return overlap
for cat in df2.catg.unique().tolist():
df1[cat] = 0
for idx1, row1 in df1.iterrows():
for idx2, row2 in df2.iterrows():
if overlap(row1, row2) > 0:
df1.loc[idx1, row2.catg] += overlap(row1, row2)
This works, but is soooo slow on larger dataframes that it's basically un-useable. If anyone has any ideas to speed this up, I'd love your input.
Thanks in advance, and let me know if something is unclear!
dataframe setup:
import pandas as pd
from pandas import Timestamp
d1 = {'datetime_start': {0: Timestamp('2016-09-11 06:00:00'), 1: Timestamp('2016-09-11 07:00:00'), 2: Timestamp('2016-09-11 07:30:00'), 3: Timestamp('2016-09-11 08:00:00'), 4: Timestamp('2016-09-11 08:30:00'), 5: Timestamp('2016-09-11 09:00:00'), 6: Timestamp('2016-09-11 09:30:00'), 7: Timestamp('2016-09-11 10:30:00'), 13: Timestamp('2016-09-11 14:00:00'), 14: Timestamp('2016-09-11 14:30:00'), 15: Timestamp('2016-09-11 15:00:00'), 16: Timestamp('2016-09-11 15:30:00'), 17: Timestamp('2016-09-11 16:00:00'), 18: Timestamp('2016-09-11 16:30:00'), 19: Timestamp('2016-09-11 17:00:00')}, 'datetime_end': {0: Timestamp('2016-09-11 06:30:00'), 1: Timestamp('2016-09-11 07:30:00'), 2: Timestamp('2016-09-11 08:00:00'), 3: Timestamp('2016-09-11 08:30:00'), 4: Timestamp('2016-09-11 09:00:00'), 5: Timestamp('2016-09-11 09:30:00'), 6: Timestamp('2016-09-11 10:00:00'), 7: Timestamp('2016-09-11 11:00:00'), 13: Timestamp('2016-09-11 14:30:00'), 14: Timestamp('2016-09-11 15:00:00'), 15: Timestamp('2016-09-11 15:30:00'), 16: Timestamp('2016-09-11 16:00:00'), 17: Timestamp('2016-09-11 16:30:00'), 18: Timestamp('2016-09-11 17:00:00'), 19: Timestamp('2016-09-11 17:30:00')}}
d2 = {'datetime_start': {4: Timestamp('2016-09-11 08:48:33'), 6: Timestamp('2016-09-11 09:54:25'), 8: Timestamp('2016-09-11 10:01:47'), 10: Timestamp('2016-09-11 10:08:00'), 12: Timestamp('2016-09-11 10:30:28'), 14: Timestamp('2016-09-11 10:38:18'), 18: Timestamp('2016-09-11 13:44:05'), 20: Timestamp('2016-09-11 13:46:52'), 23: Timestamp('2016-09-11 14:22:17'), 25: Timestamp('2016-09-11 15:00:12'), 27: Timestamp('2016-09-11 15:04:19'), 29: Timestamp('2016-09-11 15:08:43'), 31: Timestamp('2016-09-11 15:38:04'), 33: Timestamp('2016-09-11 16:18:40'), 35: Timestamp('2016-09-11 16:45:59'), 37: Timestamp('2016-09-11 17:08:31'), 39: Timestamp('2016-09-11 17:16:13'), 41: Timestamp('2016-09-11 17:17:23'), 45: Timestamp('2016-09-13 12:27:59'), 47: Timestamp('2016-09-13 12:38:39')}, 'datetime_end': {4: Timestamp('2016-09-11 09:41:53'), 6: Timestamp('2016-09-11 10:00:50'), 8: Timestamp('2016-09-11 10:04:55'), 10: Timestamp('2016-09-11 10:08:11'), 12: Timestamp('2016-09-11 10:30:28'), 14: Timestamp('2016-09-11 10:38:18'), 18: Timestamp('2016-09-11 13:44:05'), 20: Timestamp('2016-09-11 14:11:41'), 23: Timestamp('2016-09-11 14:33:40'), 25: Timestamp('2016-09-11 15:02:55'), 27: Timestamp('2016-09-11 15:06:36'), 29: Timestamp('2016-09-11 15:31:29'), 31: Timestamp('2016-09-11 16:09:24'), 33: Timestamp('2016-09-11 16:44:32'), 35: Timestamp('2016-09-11 16:59:01'), 37: Timestamp('2016-09-11 17:12:23'), 39: Timestamp('2016-09-11 17:16:33'), 41: Timestamp('2016-09-11 17:20:00'), 45: Timestamp('2016-09-13 12:34:21'), 47: Timestamp('2016-09-13 12:38:45')}, 'catg': {4: 'a', 6: 'a', 8: 'b', 10: 'b', 12: 'b', 14: 'a', 18: 'a', 20: 'd', 23: 'b', 25: 'b', 27: 'b', 29: 'd', 31: 'a', 33: 'b', 35: 'b', 37: 'b', 39: 'c', 41: 'b', 45: 'a', 47: 'a'}}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
Overlap = min(A2, B2) - max(A1, B1) + 1. In other words, the overlap of two integer intervals is a difference between the minimum value of the two upper boundaries and the maximum value of the two lower boundaries, plus 1.
You can do this by swapping the ranges if necessary up front. Then, you can detect overlap if the second range start is: less than or equal to the first range end (if ranges are inclusive, containing both the start and end times); or. less than (if ranges are inclusive of start and exclusive of end).
To calculate the number of days that overlap in two date ranges, you can use basic date arithmetic, together with the the MIN and MAX functions. Excel dates are just serial numbers, so you can calculate durations by subtracting the earlier date from the later date.
between() to Two Dates. You can use pandas. Series. between() method to select DataFrame rows between two dates.
Based on timeit
tests, with 100 executions each, the namedtuple
approach in the question averaged 15.7314
seconds on my machine, vs. an average of 1.4794
seconds with this approach:
# determine the duration of the events in df2, in seconds
duration = (df2.datetime_end - df2.datetime_start).dt.seconds.values
# create a numpy array with one timestamp for each second
# in which an event occurred
seconds_range = np.repeat(df2.datetime_start.values, duration) + \
np.concatenate(map(np.arange, duration)) * pd.Timedelta('1S')
df1.merge(pd.DataFrame({'datetime_start':seconds_range,
'catg':np.repeat(df2.catg, duration)}). \
groupby(['catg', pd.Grouper(key='datetime_start', freq='30min')]). \
size(). \
unstack(level=0). \
reset_index(),
how="left")
# datetime_end datetime_start a b c d
# 0 2016-09-11 06:30:00 2016-09-11 06:00:00 NaN NaN NaN NaN
# 1 2016-09-11 07:30:00 2016-09-11 07:00:00 NaN NaN NaN NaN
# 2 2016-09-11 08:00:00 2016-09-11 07:30:00 NaN NaN NaN NaN
# 3 2016-09-11 08:30:00 2016-09-11 08:00:00 NaN NaN NaN NaN
# 4 2016-09-11 09:00:00 2016-09-11 08:30:00 687.0 NaN NaN NaN
# 5 2016-09-11 09:30:00 2016-09-11 09:00:00 1800.0 NaN NaN NaN
# 6 2016-09-11 10:00:00 2016-09-11 09:30:00 1048.0 NaN NaN NaN
# 7 2016-09-11 11:00:00 2016-09-11 10:30:00 NaN NaN NaN NaN
# 8 2016-09-11 14:30:00 2016-09-11 14:00:00 NaN 463.0 NaN 701.0
# 9 2016-09-11 15:00:00 2016-09-11 14:30:00 NaN 220.0 NaN NaN
# 10 2016-09-11 15:30:00 2016-09-11 15:00:00 NaN 300.0 NaN 1277.0
# 11 2016-09-11 16:00:00 2016-09-11 15:30:00 1316.0 NaN NaN 89.0
# 12 2016-09-11 16:30:00 2016-09-11 16:00:00 564.0 680.0 NaN NaN
# 13 2016-09-11 17:00:00 2016-09-11 16:30:00 NaN 1654.0 NaN NaN
# 14 2016-09-11 17:30:00 2016-09-11 17:00:00 NaN 389.0 20.0 NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With