Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging dataframes based on date range

I have two pandas dataframes: one (df1) with three columns (StartDate, EndDate, and ID) and a second (df2) with a Date. I want to merge df1 and df2 based on df2.Date between df1.StartDate and df2.EndDate.

Each date range in df1 is unique and doesn't overlap with any of the other rows in the dataframe.

Dates are formatted YYYY-MM-DD.

like image 375
Nicholas Tulach Avatar asked Jul 09 '15 20:07

Nicholas Tulach


3 Answers

Just to provide an alternative way using np.piecewise. The performance is even faster than np.searchedsort.

import pandas as pd
import numpy as np

# data
# ====================================
df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'), 'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'), 'ID': np.arange(1, 10, 1)})

df2 = pd.DataFrame(dict(values=np.random.randn(50), date_time=pd.date_range('2010-01-01', periods=50, freq='D')))

df1.StartDate

Out[139]: 
0   2010-01-01
1   2010-01-06
2   2010-01-11
3   2010-01-16
4   2010-01-21
5   2010-01-26
6   2010-01-31
7   2010-02-05
8   2010-02-10
Name: StartDate, dtype: datetime64[ns]

df2.date_time

Out[140]: 
0    2010-01-01
1    2010-01-02
2    2010-01-03
3    2010-01-04
4    2010-01-05
5    2010-01-06
6    2010-01-07
7    2010-01-08
8    2010-01-09
9    2010-01-10
        ...    
40   2010-02-10
41   2010-02-11
42   2010-02-12
43   2010-02-13
44   2010-02-14
45   2010-02-15
46   2010-02-16
47   2010-02-17
48   2010-02-18
49   2010-02-19
Name: date_time, dtype: datetime64[ns]


df2['ID_matched'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date)&(df2.date_time.values <= end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)], df1.ID.values)


Out[143]: 
    date_time  values  ID_matched
0  2010-01-01 -0.2240           1
1  2010-01-02 -0.4202           1
2  2010-01-03  0.9998           1
3  2010-01-04  0.4310           1
4  2010-01-05 -0.6509           0
5  2010-01-06 -1.4987           2
6  2010-01-07 -1.2306           2
7  2010-01-08  0.1940           2
8  2010-01-09 -0.9984           2
9  2010-01-10 -0.3676           0
..        ...     ...         ...
40 2010-02-10  0.5242           9
41 2010-02-11  0.3451           9
42 2010-02-12  0.7244           9
43 2010-02-13 -2.0404           9
44 2010-02-14 -1.0798           0
45 2010-02-15 -0.6934           0
46 2010-02-16 -2.3380           0
47 2010-02-17  1.6623           0
48 2010-02-18 -0.2754           0
49 2010-02-19 -0.7466           0

[50 rows x 3 columns]

%timeit df2['ID_matched'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date)&(df2.date_time.values <= end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)], df1.ID.values)
1000 loops, best of 3: 466 µs per loop
like image 146
Jianxun Li Avatar answered Nov 14 '22 18:11

Jianxun Li


Minor correction to @JianxunLi answer. Bit too involved for a comment.

This uses the len(funclist) == len(condlist) + 1 property of piecewise to assign a default value for when there is no match. Otherwise the default no-match value is zero, which can cause problems...

### Data / inits
import pandas as pd
import numpy as np

df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'), 'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'), 'ID': np.arange(1, 10, 1)})
df2 = pd.DataFrame(dict(values=np.random.randn(50), date_time=pd.date_range('2010-01-01', periods=50, freq='D')))

### Processing
valIfNoMatch = np.nan
df2['ID_matched'] = np.piecewise(np.zeros(len(df2)),\
                                     [(df2.date_time.values >= start_date)&(df2.date_time.values < end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)],\
                                     np.append(df1.ID.values, valIfNoMatch))

PS. Also corrected the typo testing both >= & <=; a timestamp on an exact boundary between intervals would return true for two different intervals, which breaks a key assumption of the method.

like image 30
Mark_Anderson Avatar answered Nov 14 '22 16:11

Mark_Anderson


conditional_join from pyjanitor may be helpful in the abstraction/convenience; the function is currently in dev:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor

Reusing jianxun-li's data:

np.random.seed(123)

df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'), 
                    'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'), 
                    'ID': np.arange(1, 10, 1)})

df2 = pd.DataFrame(dict(values=np.random.randn(50),
                        date_time=pd.date_range('2010-01-01', periods=50, freq='D')))
df2.conditional_join(
      df1, 
      ('date_time', 'StartDate', '>='), 
      ('date_time', 'EndDate', '<=')
   )

        left                 right
      values  date_time  StartDate    EndDate ID
0  -1.085631 2010-01-01 2010-01-01 2010-01-04  1
1   0.997345 2010-01-02 2010-01-01 2010-01-04  1
2   0.282978 2010-01-03 2010-01-01 2010-01-04  1
3  -1.506295 2010-01-04 2010-01-01 2010-01-04  1
4   1.651437 2010-01-06 2010-01-06 2010-01-09  2
5  -2.426679 2010-01-07 2010-01-06 2010-01-09  2
6  -0.428913 2010-01-08 2010-01-06 2010-01-09  2
7   1.265936 2010-01-09 2010-01-06 2010-01-09  2
8  -0.678886 2010-01-11 2010-01-11 2010-01-14  3
9  -0.094709 2010-01-12 2010-01-11 2010-01-14  3
10  1.491390 2010-01-13 2010-01-11 2010-01-14  3
11 -0.638902 2010-01-14 2010-01-11 2010-01-14  3
12 -0.434351 2010-01-16 2010-01-16 2010-01-19  4
13  2.205930 2010-01-17 2010-01-16 2010-01-19  4
14  2.186786 2010-01-18 2010-01-16 2010-01-19  4
15  1.004054 2010-01-19 2010-01-16 2010-01-19  4
16  0.737369 2010-01-21 2010-01-21 2010-01-24  5
17  1.490732 2010-01-22 2010-01-21 2010-01-24  5
18 -0.935834 2010-01-23 2010-01-21 2010-01-24  5
19  1.175829 2010-01-24 2010-01-21 2010-01-24  5
20 -0.637752 2010-01-26 2010-01-26 2010-01-29  6
21  0.907105 2010-01-27 2010-01-26 2010-01-29  6
22 -1.428681 2010-01-28 2010-01-26 2010-01-29  6
23 -0.140069 2010-01-29 2010-01-26 2010-01-29  6
24 -0.255619 2010-01-31 2010-01-31 2010-02-03  7
25 -2.798589 2010-02-01 2010-01-31 2010-02-03  7
26 -1.771533 2010-02-02 2010-01-31 2010-02-03  7
27 -0.699877 2010-02-03 2010-01-31 2010-02-03  7
28 -0.173636 2010-02-05 2010-02-05 2010-02-08  8
29  0.002846 2010-02-06 2010-02-05 2010-02-08  8
30  0.688223 2010-02-07 2010-02-05 2010-02-08  8
31 -0.879536 2010-02-08 2010-02-05 2010-02-08  8
32 -0.805367 2010-02-10 2010-02-10 2010-02-13  9
33 -1.727669 2010-02-11 2010-02-10 2010-02-13  9
34 -0.390900 2010-02-12 2010-02-10 2010-02-13  9
35  0.573806 2010-02-13 2010-02-10 2010-02-13  9

Under the hood it uses np.searchsorted (binary search).

Note that pd.IntervalIndex is a more efficient option, especially when the intervals are not overlapping.

like image 1
sammywemmy Avatar answered Nov 14 '22 17:11

sammywemmy