I have two pandas dataframes: dfLeft and dfRight with the date as the index.
dfLeft:
cusip factorL
date
2012-01-03 XXXX 4.5
2012-01-03 YYYY 6.2
....
2012-01-04 XXXX 4.7
2012-01-04 YYYY 6.1
....
dfRight:
idc__id factorR
date
2012-01-03 XXXX 5.0
2012-01-03 YYYY 6.0
....
2012-01-04 XXXX 5.1
2012-01-04 YYYY 6.2
Both have a shape close to (121900,3)
I tried the following merge:
test = pd.merge(dfLeft, dfRight, left_index=True, right_index=True, left_on='cusip', right_on='idc__id', how = 'inner')
This gave test a shape of (60643500, 6)
.
Any recommendations on what is going wrong here? I want it to merge based on both date and cusip/idc_id. Note: for this example the cusips are lined up, but in reality that may not be so.
Thanks.
Expected Output test:
cusip factorL factorR
date
2012-01-03 XXXX 4.5 5.0
2012-01-03 YYYY 6.2 6.0
....
2012-01-04 XXXX 4.7 5.1
2012-01-04 YYYY 6.1 6.2
pd. concat joins on the index and can join two or more DataFrames at once. It does a full outer join by default.
Both join and merge can be used to combines two dataframes but the join method combines two dataframes on the basis of their indexes whereas the merge method is more versatile and allows us to specify columns beside the index to join on for both dataframes.
The merge() function is used to merge DataFrame or named Series objects with a database-style join. The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.
Reset the indices and then merge on multiple (column-)keys:
dfLeft.reset_index(inplace=True)
dfRight.reset_index(inplace=True)
dfMerged = pd.merge(dfLeft, dfRight,
left_on=['date', 'cusip'],
right_on=['date', 'idc__id'],
how='inner')
You can then reset 'date' as an index:
dfMerged.set_index('date', inplace=True)
Here's an example:
raw1 = '''
2012-01-03 XXXX 4.5
2012-01-03 YYYY 6.2
2012-01-04 XXXX 4.7
2012-01-04 YYYY 6.1
'''
raw2 = '''
2012-01-03 XYXX 45.
2012-01-03 YYYY 62.
2012-01-04 XXXX -47.
2012-01-05 YYYY 61.
'''
import pandas as pd
from StringIO import StringIO
df1 = pd.read_table(StringIO(raw1), header=None,
delim_whitespace=True, parse_dates=[0], skiprows=1)
df2 = pd.read_table(StringIO(raw2), header=None,
delim_whitespace=True, parse_dates=[0], skiprows=1)
df1.columns = ['date', 'cusip', 'factorL']
df2.columns = ['date', 'idc__id', 'factorL']
print pd.merge(df1, df2,
left_on=['date', 'cusip'],
right_on=['date', 'idc__id'],
how='inner')
which gives
date cusip factorL_x idc__id factorL_y
0 2012-01-03 00:00:00 YYYY 6.2 YYYY 62
1 2012-01-04 00:00:00 XXXX 4.7 XXXX -47
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With