I have two dataframes:
df_A = pd.DataFrame({'start_date':['2021-04-01 00:00:00','2021-05-01 00:00:00', '2021-05-02 00:00:00'],'end_date':['2021-04-01 00:11:00','2021-05-01 00:06:00', '2021-05-03 00:00:00'], 'eventname':['birthday', 'wedding', 'birthday'] })
df_B = pd.DataFrame({'event_date':['2021-04-01 00:06:00','2021-05-01 00:03:00', '2021-05-04 00:00:00'],'price':[100,200,500]})
df_A['end_date'] = pd.to_datetime(df_A.end_date)
df_A['start_date'] = pd.to_datetime(df_A.start_date)
df_B['event_date'] = pd.to_datetime(df_B.event_date)
df_A
start_date end_date eventname
0 2021-04-01 2021-04-01 00:11:00 birthday
1 2021-05-01 2021-05-01 00:06:00 wedding
2 2021-05-02 2021-05-03 00:00:00 birthday
df_B
event_date price
0 2021-04-01 00:06:00 100
1 2021-05-01 00:03:00 200
2 2021-05-04 00:00:00 500
I want to join them such that event_date column lies in between the start_date and end_date column of df_A. This is my code:
df_A.index = pd.IntervalIndex.from_arrays(df_A['start_date'],df_A['end_date'],closed='both')
df_B['start_date'] = df_B['event_date'].apply(lambda x : df_A.iloc[df_A.index.get_loc(x)]['start_date'])
df_B
However, I get a keyerror as some of values on event_date column of df_B do not lie between any of the start_date and end_date of columns in df_A.
THis is my expected output:
event_date price start_date
0 2021-04-01 00:06:00 100 2021-04-01
1 2021-05-01 00:03:00 200 2021-05-01
2 2021-05-04 00:00:00 500 NaN
I tried using intersection but that works only if the values are exactly equal. How do I do this?
If your start_date and end_date do not overlap, create an interval index and merge your two dataframes:
bins = pd.IntervalIndex.from_arrays(df_A['start_date'],
df_A['end_date'],
closed='both')
out = df_B.assign(interval=pd.cut(df_B['event_date'], bins)) \
.merge(df_A.assign(interval=bins), on='interval', how='left')
print(out[['event_date', 'price', 'start_date']])
# Output:
event_date price start_date
0 2021-04-01 00:06:00 100 2021-04-01
1 2021-05-01 00:03:00 200 2021-05-01
2 2021-05-04 00:00:00 500 NaT
If you can avoid a non-equi join, then do so; use a bucketing technique, similar to @Corralien's excellent answer, where the bins do not overlap (faster/more efficient). Another option, if possible, is to find a way to introduce merge joins, and maybe union your final output -> the steps may be longer, however, I feel if done right, it would be more efficient than a non-equi join. If however, you care only for the first match, then pd.merge_asof should suffice.
Back to the question :
One option is via the conditional_join from pyjanitor :
# pip install pyjanitor
import pandas as pd
import janitor
(df_A.conditional_join(
df_B,
# column from left frame, column from right frame, comparision operator
('start_date', 'event_date', '<='),
('end_date', 'event_date', '>='),
how = 'right')
.filter(['event_date', 'price', 'start_date'])
)
event_date price start_date
0 2021-04-01 00:06:00 100 2021-04-01
1 2021-05-01 00:03:00 200 2021-05-01
2 2021-05-04 00:00:00 500 NaT
What it does is compare columns from df_A with df_B:
('start_date', 'event_date', '<=') means check if start_date from df_A is <= event_date from df_B. same concept applies to the other conditions. The parameter here is variable args, meaning you can pass as many conditions as possible. The columns must be same type (float vs float, integer vs integer, date vs date, ...).
Under the hood, it uses binary search for non-equi joins (primarily to avoid a cartesian join, which can be inefficient memory wise as the dataframe size grows); for range joins such as in this case, it uses a simple search to reduce the search space to get the matching rows.
The function can handle overlapping indices, as well as scenarios where the range join involves two different columns -> a < b and c > d.
You can have a look at more examples here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With