Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge on single level of MultiIndex

Tags:

python

pandas

Is there any way to merge on a single level of a MultiIndex without resetting the index?

I have a "static" table of time-invariant values, indexed by an ObjectID, and I have a "dynamic" table of time-varying fields, indexed by ObjectID+Date. I'd like to join these tables together.

Right now, the best I can think of is:

dynamic.reset_index().merge(static, left_on=['ObjectID'], right_index=True) 

However, the dynamic table is very big, and I don't want to have to muck around with its index in order to combine the values.

like image 375
Johann Hibschman Avatar asked May 20 '13 13:05

Johann Hibschman


People also ask

How do I drop one level of MultiIndex pandas?

To drop multiple levels from a multi-level column index, use the columns. droplevel() repeatedly. We have used the Multiindex. from_tuples() is used to create indexes column-wise.

Can you merge on index pandas?

Merging Dataframes by index of both the dataframes As both the dataframe contains similar IDs on the index. So, to merge the dataframe on indices pass the left_index & right_index arguments as True i.e. Both the dataframes are merged on index using default Inner Join.


2 Answers

Yes, since pandas 0.14.0, it is now possible to merge a singly-indexed DataFrame with a level of a multi-indexed DataFrame using .join.

df1.join(df2, how='inner') # how='outer' keeps all records from both data frames 

The 0.14 pandas docs describes this as equivalent but more memory efficient and faster than:

merge(df1.reset_index(),       df2.reset_index(),       on=['index1'],       how='inner'      ).set_index(['index1','index2']) 

The docs also mention that .join can not be used to merge two multiindexed DataFrames on a single level and from the GitHub tracker discussion for the previous issue, it seems like this might not of priority to implement:

so I merged in the single join, see #6363; along with some docs on how to do a multi-multi join. That's fairly complicated to actually implement. and IMHO not worth the effort as it really doesn't change the memory usage/speed that much at all.

However, there is a GitHub conversation regarding this, where there has been some recent development https://github.com/pydata/pandas/issues/6360. It is also possible achieve this by resetting the indices as mentioned earlier and described in the docs as well.


Update for pandas >= 0.24.0

It is now possible to merge multiindexed data frames with each other. As per the release notes:

index_left = pd.MultiIndex.from_tuples([('K0', 'X0'), ('K0', 'X1'),                                         ('K1', 'X2')],                                         names=['key', 'X'])  left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],                      'B': ['B0', 'B1', 'B2']}, index=index_left)  index_right = pd.MultiIndex.from_tuples([('K0', 'Y0'), ('K1', 'Y1'),                                         ('K2', 'Y2'), ('K2', 'Y3')],                                         names=['key', 'Y'])  right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],                       'D': ['D0', 'D1', 'D2', 'D3']}, index=index_right)  left.join(right)  

Out:

            A   B   C   D key X  Y                  K0  X0 Y0  A0  B0  C0  D0     X1 Y0  A1  B1  C0  D0 K1  X2 Y1  A2  B2  C1  D1  [3 rows x 4 columns] 
like image 75
joelostblom Avatar answered Oct 03 '22 23:10

joelostblom


I get around this by reindexing the dataframe merging to have the full multiindex so that a left join is possible.

# Create the left data frame import pandas as pd idx = pd.MultiIndex(levels=[['a','b'],['c','d']],labels=[[0,0,1,1],[0,1,0,1]], names=['lvl1','lvl2']) df = pd.DataFrame([1,2,3,4],index=idx,columns=['data'])  #Create the factor to join to the data 'left data frame' newFactor = pd.DataFrame(['fact:'+str(x) for x in df.index.levels[0]], index=df.index.levels[0], columns=['newFactor']) 

Do the join on the subindex by reindexing the newFactor dataframe to contain the index of the left data frame

df.join(newFactor.reindex(df.index,level=0)) 
like image 38
closedloop Avatar answered Oct 04 '22 01:10

closedloop