I would like to merge two dataframes, but can't exactly figure out how to do so without iterating. Basically, I want to merge a row from df2 to df1 if df1.date >= df2.start_date and df1.date <= df2.end_date. See example below:
df1:
index date value
0 2012-08-01 82
1 2012-08-02 20
2 2012-08-03 94
...
n-1 2012-10-29 58
n 2012-10-30 73
df2:
index start_date end_date other_value
0 2012-08-01 2012-09-04 'foo'
1 2012-09-05 2012-10-15 'bar'
2 2012-10-16 2012-11-01 'foobar'
...
final_df:
index df2_index date value other_value
0 0 2012-08-01 82 'foo'
1 0 2012-08-02 20 'foo'
2 0 2012-08-03 94 'foo'
...
n-1 2 2012-10-29 58 'foobar'
n 2 2012-10-30 73 'foobar'
I thought about creating a date series vector to merge with df2, so that I can combine on date, but it seems very manual and does not leverage the power/speed of pandas. I also thought about trying to expand df2 into single days, but couldn't find any way to do so without a manual / iteration type solution.
The naive iterative approach is O(n*m)
, where n = len(df1)
and m = len(df2)
, since for each date in df1
you would have to check its inclusion in up to m
intervals.
If the intervals defined by df2
are disjoint, then there is a theoretically better way: use searchsorted to find where each date in df1
fits amongst the start_dates, and then use searchsorted
a second time to find where each date fits amongst the end_dates. When the index from the two calls to searchsorted
are equal, the date falls inside an interval.
Searchsorted assumes the cutoff dates are sorted and it uses binary search, so each call has complexity O(n*log(m)).
If m
is large enough, using searchsorted
should be faster
than the naive iterative approach.
If m
is not large, the iterative approach may be faster.
Here is an example, using searchsorted
:
import numpy as np
import pandas as pd
Timestamp = pd.Timestamp
df1 = pd.DataFrame({'date': (Timestamp('2012-08-01'),
Timestamp('2012-08-02'),
Timestamp('2012-08-03'),
Timestamp('2012-10-29'),
Timestamp('2012-10-30'),
Timestamp('2012-11-01'),
Timestamp('2012-10-15'), # on then end_date
Timestamp('2012-09-04'), # outside an interval
Timestamp('2012-09-05'), # on then start_date
),
'value': (82, 20, 94, 58, 73, 1, 2, 3, 4)})
print(df1)
df2 = pd.DataFrame({'end_date': (
Timestamp('2012-10-15'),
Timestamp('2012-09-04'),
Timestamp('2012-11-01')),
'other_value': ("foo", "bar", "foobar"),
'start_date': (
Timestamp('2012-09-05'),
Timestamp('2012-08-01'),
Timestamp('2012-10-16'))})
df2 = df2.reindex(columns=['start_date', 'end_date', 'other_value'])
df2.sort(['start_date'], inplace=True)
print(df2)
# Convert to DatetimeIndexes so we can call the searchsorted method
date_idx = pd.DatetimeIndex(df1['date'])
start_date_idx = pd.DatetimeIndex(df2['start_date'])
# Add one to the end_date so the original end_date will be included in the
# half-open interval.
end_date_idx = pd.DatetimeIndex(df2['end_date'])+pd.DateOffset(days=1)
start_idx = start_date_idx.searchsorted(date_idx, side='right')-1
end_idx = end_date_idx.searchsorted(date_idx, side='right')
df1['idx'] = np.where(start_idx == end_idx, end_idx, np.nan)
result = pd.merge(df1, df2, left_on=['idx'], right_index=True)
result = result.reindex(columns=['idx', 'date', 'value', 'other_value'])
print(result)
With df1
equal to
date value
0 2012-08-01 82
1 2012-08-02 20
2 2012-08-03 94
3 2012-10-29 58
4 2012-10-30 73
5 2012-11-01 1
6 2012-10-15 2
7 2012-09-04 3
8 2012-09-05 4
and df2
equal to
start_date end_date other_value
1 2012-08-01 2012-09-04 bar
0 2012-09-05 2012-10-15 foo
2 2012-10-16 2012-11-01 foobar
the above code yields
idx date value other_value
0 0 2012-08-01 82 foo
1 0 2012-08-02 20 foo
2 0 2012-08-03 94 foo
7 0 2012-09-04 3 foo
3 2 2012-10-29 58 foobar
4 2 2012-10-30 73 foobar
5 2 2012-11-01 1 foobar
6 1 2012-10-15 2 bar
8 1 2012-09-05 4 bar
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With