I have 3 DataFrames
that have differing numbers of shared indexes. For example:
>>> df0=pd.DataFrame(index=pd.MultiIndex.from_product([[1,2,3,4],[2011,2012],['A','B']], names=['Season','Year','Location']))
>>> df0['Value0']=np.random.randint(1,100,len(df0))
>>>
>>> df1=pd.DataFrame(index=pd.MultiIndex.from_product([[2011,2012],['A','B']], names=['Year','Location']))
>>> df1['Value1']=np.random.randint(1,100,len(df1))
>>>
>>> df2=pd.DataFrame(index=['A','B'])
>>> df2.index.name='Location'
>>> df2['Value2']=np.random.randint(1,100,len(df2))
>>> df0
Value0
Season Year Location
1 2011 A 18
B 63
2012 A 88
B 30
2 2011 A 35
B 60
2012 A 61
B 4
3 2011 A 70
B 9
2012 A 11
B 38
4 2011 A 68
B 57
2012 A 13
B 35
>>> df1
Value1
Year Location
2011 A 22
B 74
2012 A 73
B 44
>>> df2
Value2
Location
A 70
B 85
>>>
I am looking for the best way to join them on their common indexes.
Things I have tried:
1) pd.concat([df0,df1,df2],1)
would be nice because it accepts a list of dataframes, but this appears to only work if the dataframes have the same number of indexes.
2) Joining either of the multiindex dataframes with the single index DataFrame
works: df1.join(df2)
or df0.join(df2)
. However joining the DataFrame
with two indexes with the DataFrame
with three indexes does not: df0.join(df1)
and gives me the following error: "NotImplementedError: merging with more than one level overlap on a multi-index is not implemented"
At this point, the approach I have been taking is to reset the indexes and use pd.merge()
. See below:
def JoinMulti(DFList):
FinalDF=DFList[0].reset_index()
for OtherDF in DFList[1:]:
FinalDF=pd.merge(FinalDF, OtherDF.reset_index(), how='outer')
#Now I want reindex it so that it's indexed the same as the `DataFrame` with the highest number of levels
NLevels=[x.index.nlevels for x in DFList]
MaxIndexPos=NLevels.index(max(NLevels))
FinalIndex=DFList[MaxIndexPos].index
FinalDF=FinalDF.set_index(FinalIndex.names).reindex(FinalIndex)
return FinalDF
>>> JoinMulti([df0,df1,df2])
Value0 Value1 Value2
Season Year Location
1 2011 A 43 5 96
B 63 46 97
2012 A 68 6 96
B 23 99 97
2 2011 A 66 5 96
B 30 46 97
2012 A 45 6 96
B 79 99 97
3 2011 A 66 5 96
B 21 46 97
2012 A 86 6 96
B 11 99 97
4 2011 A 95 5 96
B 58 46 97
2012 A 32 6 96
B 80 99 97
>>>
Is this a reasonable approach? Are there any improvements that could be made or any exceptions that I'm forgetting about?
You can also use DataFrame. join() method to achieve the same thing. The join method will persist the original index. The column to join can be specified with on parameter.
pd. concat joins on the index and can join two or more DataFrames at once. It does a full outer join by default.
You should be able to use join , which joins on the index as default. Given your desired result, you must use outer as the join type. Signature: _. join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False) Docstring: Join columns with other DataFrame either on index or on a key column.
The merge() function is used to merge DataFrame or named Series objects with a database-style join. The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.
I modified solution by Stefan Jansen:
def jez(df0,df1,df2):
df1 = df1.join(df2)
df0 = df0.reset_index('Season')
FinalDF = df0.join(df1).set_index('Season', append=True).reorder_levels(['Season', 'Year', 'Location']).sortlevel()
return FinalDF
print jez(df0,df1,df2)
Timing:
In [41]: %timeit jez(df0,df1,df2)
The slowest run took 4.14 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 5.02 ms per loop
In [42]: %timeit JoinMulti([df0,df1,df2])
100 loops, best of 3: 9.83 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With