I have two pandas dataframes: one (df1
) with three columns (StartDate
, EndDate
, and ID
) and a second (df2
) with a Date. I want to merge df1
and df2
based on df2.Date between df1.StartDate
and df2.EndDate
.
Each date range in df1
is unique and doesn't overlap with any of the other rows in the dataframe.
Dates are formatted YYYY-MM-DD
.
Just to provide an alternative way using np.piecewise
. The performance is even faster than np.searchedsort
.
import pandas as pd
import numpy as np
# data
# ====================================
df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'), 'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'), 'ID': np.arange(1, 10, 1)})
df2 = pd.DataFrame(dict(values=np.random.randn(50), date_time=pd.date_range('2010-01-01', periods=50, freq='D')))
df1.StartDate
Out[139]:
0 2010-01-01
1 2010-01-06
2 2010-01-11
3 2010-01-16
4 2010-01-21
5 2010-01-26
6 2010-01-31
7 2010-02-05
8 2010-02-10
Name: StartDate, dtype: datetime64[ns]
df2.date_time
Out[140]:
0 2010-01-01
1 2010-01-02
2 2010-01-03
3 2010-01-04
4 2010-01-05
5 2010-01-06
6 2010-01-07
7 2010-01-08
8 2010-01-09
9 2010-01-10
...
40 2010-02-10
41 2010-02-11
42 2010-02-12
43 2010-02-13
44 2010-02-14
45 2010-02-15
46 2010-02-16
47 2010-02-17
48 2010-02-18
49 2010-02-19
Name: date_time, dtype: datetime64[ns]
df2['ID_matched'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date)&(df2.date_time.values <= end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)], df1.ID.values)
Out[143]:
date_time values ID_matched
0 2010-01-01 -0.2240 1
1 2010-01-02 -0.4202 1
2 2010-01-03 0.9998 1
3 2010-01-04 0.4310 1
4 2010-01-05 -0.6509 0
5 2010-01-06 -1.4987 2
6 2010-01-07 -1.2306 2
7 2010-01-08 0.1940 2
8 2010-01-09 -0.9984 2
9 2010-01-10 -0.3676 0
.. ... ... ...
40 2010-02-10 0.5242 9
41 2010-02-11 0.3451 9
42 2010-02-12 0.7244 9
43 2010-02-13 -2.0404 9
44 2010-02-14 -1.0798 0
45 2010-02-15 -0.6934 0
46 2010-02-16 -2.3380 0
47 2010-02-17 1.6623 0
48 2010-02-18 -0.2754 0
49 2010-02-19 -0.7466 0
[50 rows x 3 columns]
%timeit df2['ID_matched'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date)&(df2.date_time.values <= end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)], df1.ID.values)
1000 loops, best of 3: 466 µs per loop
Minor correction to @JianxunLi answer. Bit too involved for a comment.
This uses the len(funclist) == len(condlist) + 1
property of piecewise
to assign a default value for when there is no match. Otherwise the default no-match value is zero, which can cause problems...
### Data / inits
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'), 'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'), 'ID': np.arange(1, 10, 1)})
df2 = pd.DataFrame(dict(values=np.random.randn(50), date_time=pd.date_range('2010-01-01', periods=50, freq='D')))
### Processing
valIfNoMatch = np.nan
df2['ID_matched'] = np.piecewise(np.zeros(len(df2)),\
[(df2.date_time.values >= start_date)&(df2.date_time.values < end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)],\
np.append(df1.ID.values, valIfNoMatch))
PS. Also corrected the typo testing both >=
& <=
; a timestamp on an exact boundary between intervals would return true for two different intervals, which breaks a key assumption of the method.
conditional_join from pyjanitor may be helpful in the abstraction/convenience; the function is currently in dev:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
Reusing jianxun-li's data:
np.random.seed(123)
df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'),
'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'),
'ID': np.arange(1, 10, 1)})
df2 = pd.DataFrame(dict(values=np.random.randn(50),
date_time=pd.date_range('2010-01-01', periods=50, freq='D')))
df2.conditional_join(
df1,
('date_time', 'StartDate', '>='),
('date_time', 'EndDate', '<=')
)
left right
values date_time StartDate EndDate ID
0 -1.085631 2010-01-01 2010-01-01 2010-01-04 1
1 0.997345 2010-01-02 2010-01-01 2010-01-04 1
2 0.282978 2010-01-03 2010-01-01 2010-01-04 1
3 -1.506295 2010-01-04 2010-01-01 2010-01-04 1
4 1.651437 2010-01-06 2010-01-06 2010-01-09 2
5 -2.426679 2010-01-07 2010-01-06 2010-01-09 2
6 -0.428913 2010-01-08 2010-01-06 2010-01-09 2
7 1.265936 2010-01-09 2010-01-06 2010-01-09 2
8 -0.678886 2010-01-11 2010-01-11 2010-01-14 3
9 -0.094709 2010-01-12 2010-01-11 2010-01-14 3
10 1.491390 2010-01-13 2010-01-11 2010-01-14 3
11 -0.638902 2010-01-14 2010-01-11 2010-01-14 3
12 -0.434351 2010-01-16 2010-01-16 2010-01-19 4
13 2.205930 2010-01-17 2010-01-16 2010-01-19 4
14 2.186786 2010-01-18 2010-01-16 2010-01-19 4
15 1.004054 2010-01-19 2010-01-16 2010-01-19 4
16 0.737369 2010-01-21 2010-01-21 2010-01-24 5
17 1.490732 2010-01-22 2010-01-21 2010-01-24 5
18 -0.935834 2010-01-23 2010-01-21 2010-01-24 5
19 1.175829 2010-01-24 2010-01-21 2010-01-24 5
20 -0.637752 2010-01-26 2010-01-26 2010-01-29 6
21 0.907105 2010-01-27 2010-01-26 2010-01-29 6
22 -1.428681 2010-01-28 2010-01-26 2010-01-29 6
23 -0.140069 2010-01-29 2010-01-26 2010-01-29 6
24 -0.255619 2010-01-31 2010-01-31 2010-02-03 7
25 -2.798589 2010-02-01 2010-01-31 2010-02-03 7
26 -1.771533 2010-02-02 2010-01-31 2010-02-03 7
27 -0.699877 2010-02-03 2010-01-31 2010-02-03 7
28 -0.173636 2010-02-05 2010-02-05 2010-02-08 8
29 0.002846 2010-02-06 2010-02-05 2010-02-08 8
30 0.688223 2010-02-07 2010-02-05 2010-02-08 8
31 -0.879536 2010-02-08 2010-02-05 2010-02-08 8
32 -0.805367 2010-02-10 2010-02-10 2010-02-13 9
33 -1.727669 2010-02-11 2010-02-10 2010-02-13 9
34 -0.390900 2010-02-12 2010-02-10 2010-02-13 9
35 0.573806 2010-02-13 2010-02-10 2010-02-13 9
Under the hood it uses np.searchsorted
(binary search).
Note that pd.IntervalIndex
is a more efficient option, especially when the intervals are not overlapping.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With