Efficiently find overlap of date-time ranges from 2 dataframes

Tags:

There are some questions out there regarding finding the overlap in date or time ranges (for example). I've used these to solve my problem, but I've ended up with an extremely slow (and not-at-all elegant) solution to my problem. I would really appreciate it if someone has an idea of how to make this faster (and more elegant):

The Problem:

I've got 2 dataframes, df1 and df2, each with 2 columns that represent a start time and an end time:

>>> df1

        datetime_start        datetime_end
0  2016-09-11 06:00:00 2016-09-11 06:30:00
1  2016-09-11 07:00:00 2016-09-11 07:30:00
2  2016-09-11 07:30:00 2016-09-11 08:00:00
3  2016-09-11 08:00:00 2016-09-11 08:30:00
4  2016-09-11 08:30:00 2016-09-11 09:00:00
5  2016-09-11 09:00:00 2016-09-11 09:30:00
6  2016-09-11 09:30:00 2016-09-11 10:00:00
7  2016-09-11 10:30:00 2016-09-11 11:00:00
13 2016-09-11 14:00:00 2016-09-11 14:30:00
14 2016-09-11 14:30:00 2016-09-11 15:00:00
15 2016-09-11 15:00:00 2016-09-11 15:30:00
16 2016-09-11 15:30:00 2016-09-11 16:00:00
17 2016-09-11 16:00:00 2016-09-11 16:30:00
18 2016-09-11 16:30:00 2016-09-11 17:00:00
19 2016-09-11 17:00:00 2016-09-11 17:30:00

>>> df2

        datetime_start        datetime_end catg
4  2016-09-11 08:48:33 2016-09-11 09:41:53    a
6  2016-09-11 09:54:25 2016-09-11 10:00:50    a
8  2016-09-11 10:01:47 2016-09-11 10:04:55    b
10 2016-09-11 10:08:00 2016-09-11 10:08:11    b
12 2016-09-11 10:30:28 2016-09-11 10:30:28    b
14 2016-09-11 10:38:18 2016-09-11 10:38:18    a
18 2016-09-11 13:44:05 2016-09-11 13:44:05    a
20 2016-09-11 13:46:52 2016-09-11 14:11:41    d
23 2016-09-11 14:22:17 2016-09-11 14:33:40    b
25 2016-09-11 15:00:12 2016-09-11 15:02:55    b
27 2016-09-11 15:04:19 2016-09-11 15:06:36    b
29 2016-09-11 15:08:43 2016-09-11 15:31:29    d
31 2016-09-11 15:38:04 2016-09-11 16:09:24    a
33 2016-09-11 16:18:40 2016-09-11 16:44:32    b
35 2016-09-11 16:45:59 2016-09-11 16:59:01    b
37 2016-09-11 17:08:31 2016-09-11 17:12:23    b
39 2016-09-11 17:16:13 2016-09-11 17:16:33    c
41 2016-09-11 17:17:23 2016-09-11 17:20:00    b
45 2016-09-13 12:27:59 2016-09-13 12:34:21    a
47 2016-09-13 12:38:39 2016-09-13 12:38:45    a

What I want is to find where the ranges in df2 have overlap with the ranges in df1, how long that overlap is (in seconds), and what value of df2.catg that is. I want the length of that overlap inserted into a column in df1 (that column will be named for the catg it represents).

Desired output:

>>> df1
        datetime_start        datetime_end       a       b       d     c
0  2016-09-11 06:00:00 2016-09-11 06:30:00     0.0     0.0     0.0   0.0
1  2016-09-11 07:00:00 2016-09-11 07:30:00     0.0     0.0     0.0   0.0
2  2016-09-11 07:30:00 2016-09-11 08:00:00     0.0     0.0     0.0   0.0
3  2016-09-11 08:00:00 2016-09-11 08:30:00     0.0     0.0     0.0   0.0
4  2016-09-11 08:30:00 2016-09-11 09:00:00   687.0     0.0     0.0   0.0
5  2016-09-11 09:00:00 2016-09-11 09:30:00  1800.0     0.0     0.0   0.0
6  2016-09-11 09:30:00 2016-09-11 10:00:00  1048.0     0.0     0.0   0.0
7  2016-09-11 10:30:00 2016-09-11 11:00:00     0.0     0.0     0.0   0.0
13 2016-09-11 14:00:00 2016-09-11 14:30:00     0.0   463.0   701.0   0.0
14 2016-09-11 14:30:00 2016-09-11 15:00:00     0.0   220.0     0.0   0.0
15 2016-09-11 15:00:00 2016-09-11 15:30:00     0.0   300.0  1277.0   0.0
16 2016-09-11 15:30:00 2016-09-11 16:00:00  1316.0     0.0    89.0   0.0
17 2016-09-11 16:00:00 2016-09-11 16:30:00   564.0   680.0     0.0   0.0
18 2016-09-11 16:30:00 2016-09-11 17:00:00     0.0  1654.0     0.0   0.0
19 2016-09-11 17:00:00 2016-09-11 17:30:00     0.0   389.0     0.0  20.0

The ridiculously slow way to do this:

Based on this beautiful answer, I've achieved the goals I want using the following hard to follow code:

from collections import namedtuple
Range = namedtuple('Range', ['start', 'end'])

def overlap(row1, row2):
    r1 = Range(start=row1.datetime_start, end=row1.datetime_end)
    r2 = Range(start=row2.datetime_start, end=row2.datetime_end)
    latest_start = max(r1.start, r2.start)
    earliest_end = min(r1.end, r2.end)
    delta = (earliest_end - latest_start).total_seconds()
    overlap = max(0, delta)
    return overlap

for cat in df2.catg.unique().tolist():
    df1[cat] = 0

for idx1, row1 in df1.iterrows():
    for idx2, row2 in df2.iterrows():
        if overlap(row1, row2) > 0:
            df1.loc[idx1, row2.catg] += overlap(row1, row2)

This works, but is soooo slow on larger dataframes that it's basically un-useable. If anyone has any ideas to speed this up, I'd love your input.

Thanks in advance, and let me know if something is unclear!

dataframe setup:

import pandas as pd
from pandas import Timestamp

d1 = {'datetime_start': {0: Timestamp('2016-09-11 06:00:00'), 1: Timestamp('2016-09-11 07:00:00'), 2: Timestamp('2016-09-11 07:30:00'), 3: Timestamp('2016-09-11 08:00:00'), 4: Timestamp('2016-09-11 08:30:00'), 5: Timestamp('2016-09-11 09:00:00'), 6: Timestamp('2016-09-11 09:30:00'), 7: Timestamp('2016-09-11 10:30:00'), 13: Timestamp('2016-09-11 14:00:00'), 14: Timestamp('2016-09-11 14:30:00'), 15: Timestamp('2016-09-11 15:00:00'), 16: Timestamp('2016-09-11 15:30:00'), 17: Timestamp('2016-09-11 16:00:00'), 18: Timestamp('2016-09-11 16:30:00'), 19: Timestamp('2016-09-11 17:00:00')}, 'datetime_end': {0: Timestamp('2016-09-11 06:30:00'), 1: Timestamp('2016-09-11 07:30:00'), 2: Timestamp('2016-09-11 08:00:00'), 3: Timestamp('2016-09-11 08:30:00'), 4: Timestamp('2016-09-11 09:00:00'), 5: Timestamp('2016-09-11 09:30:00'), 6: Timestamp('2016-09-11 10:00:00'), 7: Timestamp('2016-09-11 11:00:00'), 13: Timestamp('2016-09-11 14:30:00'), 14: Timestamp('2016-09-11 15:00:00'), 15: Timestamp('2016-09-11 15:30:00'), 16: Timestamp('2016-09-11 16:00:00'), 17: Timestamp('2016-09-11 16:30:00'), 18: Timestamp('2016-09-11 17:00:00'), 19: Timestamp('2016-09-11 17:30:00')}}

d2 = {'datetime_start': {4: Timestamp('2016-09-11 08:48:33'), 6: Timestamp('2016-09-11 09:54:25'), 8: Timestamp('2016-09-11 10:01:47'), 10: Timestamp('2016-09-11 10:08:00'), 12: Timestamp('2016-09-11 10:30:28'), 14: Timestamp('2016-09-11 10:38:18'), 18: Timestamp('2016-09-11 13:44:05'), 20: Timestamp('2016-09-11 13:46:52'), 23: Timestamp('2016-09-11 14:22:17'), 25: Timestamp('2016-09-11 15:00:12'), 27: Timestamp('2016-09-11 15:04:19'), 29: Timestamp('2016-09-11 15:08:43'), 31: Timestamp('2016-09-11 15:38:04'), 33: Timestamp('2016-09-11 16:18:40'), 35: Timestamp('2016-09-11 16:45:59'), 37: Timestamp('2016-09-11 17:08:31'), 39: Timestamp('2016-09-11 17:16:13'), 41: Timestamp('2016-09-11 17:17:23'), 45: Timestamp('2016-09-13 12:27:59'), 47: Timestamp('2016-09-13 12:38:39')}, 'datetime_end': {4: Timestamp('2016-09-11 09:41:53'), 6: Timestamp('2016-09-11 10:00:50'), 8: Timestamp('2016-09-11 10:04:55'), 10: Timestamp('2016-09-11 10:08:11'), 12: Timestamp('2016-09-11 10:30:28'), 14: Timestamp('2016-09-11 10:38:18'), 18: Timestamp('2016-09-11 13:44:05'), 20: Timestamp('2016-09-11 14:11:41'), 23: Timestamp('2016-09-11 14:33:40'), 25: Timestamp('2016-09-11 15:02:55'), 27: Timestamp('2016-09-11 15:06:36'), 29: Timestamp('2016-09-11 15:31:29'), 31: Timestamp('2016-09-11 16:09:24'), 33: Timestamp('2016-09-11 16:44:32'), 35: Timestamp('2016-09-11 16:59:01'), 37: Timestamp('2016-09-11 17:12:23'), 39: Timestamp('2016-09-11 17:16:33'), 41: Timestamp('2016-09-11 17:20:00'), 45: Timestamp('2016-09-13 12:34:21'), 47: Timestamp('2016-09-13 12:38:45')}, 'catg': {4: 'a', 6: 'a', 8: 'b', 10: 'b', 12: 'b', 14: 'a', 18: 'a', 20: 'd', 23: 'b', 25: 'b', 27: 'b', 29: 'd', 31: 'a', 33: 'b', 35: 'b', 37: 'b', 39: 'c', 41: 'b', 45: 'a', 47: 'a'}}

df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

813

asked Apr 25 '18 21:04

sacuL

1 Answers

Based on timeit tests, with 100 executions each, the namedtuple approach in the question averaged 15.7314 seconds on my machine, vs. an average of 1.4794 seconds with this approach:

# determine the duration of the events in df2, in seconds
duration = (df2.datetime_end - df2.datetime_start).dt.seconds.values

# create a numpy array with one timestamp for each second 
# in which an event occurred
seconds_range = np.repeat(df2.datetime_start.values, duration) + \
                np.concatenate(map(np.arange, duration)) * pd.Timedelta('1S')

df1.merge(pd.DataFrame({'datetime_start':seconds_range,
                        'catg':np.repeat(df2.catg, duration)}). \
              groupby(['catg', pd.Grouper(key='datetime_start', freq='30min')]). \
              size(). \
              unstack(level=0). \
              reset_index(), 
          how="left")

#           datetime_end      datetime_start       a       b     c       d
# 0  2016-09-11 06:30:00 2016-09-11 06:00:00     NaN     NaN   NaN     NaN
# 1  2016-09-11 07:30:00 2016-09-11 07:00:00     NaN     NaN   NaN     NaN
# 2  2016-09-11 08:00:00 2016-09-11 07:30:00     NaN     NaN   NaN     NaN
# 3  2016-09-11 08:30:00 2016-09-11 08:00:00     NaN     NaN   NaN     NaN
# 4  2016-09-11 09:00:00 2016-09-11 08:30:00   687.0     NaN   NaN     NaN
# 5  2016-09-11 09:30:00 2016-09-11 09:00:00  1800.0     NaN   NaN     NaN
# 6  2016-09-11 10:00:00 2016-09-11 09:30:00  1048.0     NaN   NaN     NaN
# 7  2016-09-11 11:00:00 2016-09-11 10:30:00     NaN     NaN   NaN     NaN
# 8  2016-09-11 14:30:00 2016-09-11 14:00:00     NaN   463.0   NaN   701.0
# 9  2016-09-11 15:00:00 2016-09-11 14:30:00     NaN   220.0   NaN     NaN
# 10 2016-09-11 15:30:00 2016-09-11 15:00:00     NaN   300.0   NaN  1277.0
# 11 2016-09-11 16:00:00 2016-09-11 15:30:00  1316.0     NaN   NaN    89.0
# 12 2016-09-11 16:30:00 2016-09-11 16:00:00   564.0   680.0   NaN     NaN
# 13 2016-09-11 17:00:00 2016-09-11 16:30:00     NaN  1654.0   NaN     NaN
# 14 2016-09-11 17:30:00 2016-09-11 17:00:00     NaN   389.0  20.0     NaN

answered Oct 11 '22 12:10

cmaher

Related questions
                            
                                How to deal with DAG lib in airflow?
                            
                                Shutdown dask workers from client or scheduler
                            
                                Cmd Windows "python" command works, but "python3" doesn't although my python version is 3.6
                            
                                React Flask Heroku App is not displaying frontend
                            
                                Modifying class __dict__ when shadowed by a property
                            
                                How to download this video using Selenium
                            
                                How do you recursively get all submodules in a python package?
                            
                                In Python 3.6, why does a negative number to the power of a fraction return nan when in a numpy array?
                            
                                Slice pandas dataframe json column into columns
                            
                                Is there a way to get the error in fitting parameters from scipy.stats.norm.fit?
                            
                                Save jaw only as image with dlib facial landmark detection and the rest to be transparent
                            
                                Django - Form across multiple views with progress saving
                            
                                How does the `my_input_fn` in the getting started with TensorFlow allow enumeration over the data?
                            
                                Google colaboratory run code locally
                            
                                Change training dataset every N epochs in Keras
                            
                                Activating python virtual environment does not switch to local versions of pip and python commands
                            
                                Is it possible to run multiple instances of one selenium test at once?
                            
                                Why python broadcasting in the example below is slower than a simple loop?
                            
                                Removing the white border around an image when using matplotlib without saving the image
                            
                                _pickle.PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiently find overlap of date-time ranges from 2 dataframes

Tags:

performance

python

datetime

pandas

date-range

sacuL

People also ask

1 Answers

cmaher

Recent Activity

Donate For Us