Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas merge with logic

Tags:

python

pandas

I would like to merge two dataframes, but can't exactly figure out how to do so without iterating. Basically, I want to merge a row from df2 to df1 if df1.date >= df2.start_date and df1.date <= df2.end_date. See example below:

df1:
index   date         value
0       2012-08-01   82
1       2012-08-02   20
2       2012-08-03   94
...
n-1     2012-10-29   58
n       2012-10-30   73

df2:
index   start_date   end_date     other_value
0       2012-08-01   2012-09-04   'foo'
1       2012-09-05   2012-10-15   'bar'
2       2012-10-16   2012-11-01   'foobar'
...


final_df:
index   df2_index   date         value  other_value
0       0           2012-08-01   82     'foo'
1       0           2012-08-02   20     'foo'
2       0           2012-08-03   94     'foo'
...
n-1     2           2012-10-29   58     'foobar'
n       2           2012-10-30   73     'foobar'

I thought about creating a date series vector to merge with df2, so that I can combine on date, but it seems very manual and does not leverage the power/speed of pandas. I also thought about trying to expand df2 into single days, but couldn't find any way to do so without a manual / iteration type solution.

like image 344
nscricco Avatar asked Aug 04 '14 18:08

nscricco


1 Answers

The naive iterative approach is O(n*m), where n = len(df1) and m = len(df2), since for each date in df1 you would have to check its inclusion in up to m intervals.

If the intervals defined by df2 are disjoint, then there is a theoretically better way: use searchsorted to find where each date in df1 fits amongst the start_dates, and then use searchsorted a second time to find where each date fits amongst the end_dates. When the index from the two calls to searchsorted are equal, the date falls inside an interval.

Searchsorted assumes the cutoff dates are sorted and it uses binary search, so each call has complexity O(n*log(m)).

If m is large enough, using searchsorted should be faster than the naive iterative approach.

If m is not large, the iterative approach may be faster.


Here is an example, using searchsorted:

import numpy as np
import pandas as pd
Timestamp = pd.Timestamp
df1 = pd.DataFrame({'date': (Timestamp('2012-08-01'),
                             Timestamp('2012-08-02'),
                             Timestamp('2012-08-03'),
                             Timestamp('2012-10-29'),
                             Timestamp('2012-10-30'),
                             Timestamp('2012-11-01'),
                             Timestamp('2012-10-15'),  # on then end_date
                             Timestamp('2012-09-04'),  # outside an interval
                             Timestamp('2012-09-05'),  # on then start_date
                             ),
                    'value': (82, 20, 94, 58, 73, 1, 2, 3, 4)})

print(df1)
df2 = pd.DataFrame({'end_date': (
                        Timestamp('2012-10-15'),
                        Timestamp('2012-09-04'),
                        Timestamp('2012-11-01')),
                    'other_value': ("foo", "bar", "foobar"),
                    'start_date': (
                        Timestamp('2012-09-05'),
                        Timestamp('2012-08-01'),
                        Timestamp('2012-10-16'))})
df2 = df2.reindex(columns=['start_date', 'end_date', 'other_value'])
df2.sort(['start_date'], inplace=True)
print(df2)

# Convert to DatetimeIndexes so we can call the searchsorted method
date_idx = pd.DatetimeIndex(df1['date'])
start_date_idx = pd.DatetimeIndex(df2['start_date'])
# Add one to the end_date so the original end_date will be included in the
# half-open interval.
end_date_idx = pd.DatetimeIndex(df2['end_date'])+pd.DateOffset(days=1)

start_idx = start_date_idx.searchsorted(date_idx, side='right')-1
end_idx = end_date_idx.searchsorted(date_idx, side='right')
df1['idx'] = np.where(start_idx == end_idx, end_idx, np.nan)

result = pd.merge(df1, df2, left_on=['idx'], right_index=True)
result = result.reindex(columns=['idx', 'date', 'value', 'other_value'])
print(result)

With df1 equal to

        date  value
0 2012-08-01     82
1 2012-08-02     20
2 2012-08-03     94
3 2012-10-29     58
4 2012-10-30     73
5 2012-11-01      1
6 2012-10-15      2
7 2012-09-04      3
8 2012-09-05      4

and df2 equal to

  start_date   end_date other_value
1 2012-08-01 2012-09-04         bar
0 2012-09-05 2012-10-15         foo
2 2012-10-16 2012-11-01      foobar

the above code yields

   idx       date  value other_value
0    0 2012-08-01     82         foo
1    0 2012-08-02     20         foo
2    0 2012-08-03     94         foo
7    0 2012-09-04      3         foo
3    2 2012-10-29     58      foobar
4    2 2012-10-30     73      foobar
5    2 2012-11-01      1      foobar
6    1 2012-10-15      2         bar
8    1 2012-09-05      4         bar
like image 143
unutbu Avatar answered Oct 13 '22 23:10

unutbu