Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently joining two dataframes based on multiple levels of a multiindex

I frequently have a dataframe with a large multiindex, and a secondary DataFrame with a MultiIndex that is a subset of the larger one. The secondary dataframe is usually some kind of lookup table. I often want to add the columns from the lookup table to the larger dataframe. The primary DataFrame is often very large, so I want to do this efficiently.

Here is an imaginary example, where I construct two dataframes df1 and df2

import pandas as pd
import numpy as np

arrays = [['sun', 'sun', 'sun', 'moon', 'moon', 'moon', 'moon', 'moon'],
          ['summer', 'winter', 'winter', 'summer', 'summer', 'summer', 'winter', 'winter'],
          ['one', 'one', 'two', 'one', 'two', 'three', 'one', 'two']]

tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['Body', 'Season','Item'])
df1 = pd.DataFrame(np.random.randn(8,2), index=index,columns=['A','B'])

index2= pd.MultiIndex.from_tuples([('sun','summer'),('sun','winter'),('moon','summer'),('moon','winter')],
                                  names=['Body','Season'])

df2 = pd.DataFrame(['Good','Bad','Ugly','Confused'],index=index2,columns = ['Mood'])

Giving the dataframes:

df1

                    A         B
Body Season Item                     
sun  summer one   -0.409372  0.638502
     winter one    1.448772 -1.460596
            two   -0.495634 -0.839063
moon summer one    1.296035 -1.439349
            two   -1.002667  0.508394
            three -1.247748 -0.645782
     winter one   -1.848857 -0.858759
            two    0.559172  2.202957

df2

                 Mood
Body Season          
sun  summer      Good
     winter       Bad
moon summer      Ugly
     winter  Confused

Now, suppose I want to add the columns from df2 to df1? This line is the only way I could find to do the job:

df1 = df1.reset_index().join(df2,on=['Body','Season']).set_index(df1.index.names)

resulting in:

           A         B      Mood
Body Season Item
sun  summer one   -0.121588  0.272774      Good
     winter one    0.233562 -2.005623       Bad
            two   -1.034642  0.315065       Bad
moon summer one    0.184548  0.820873      Ugly
            two    0.838290  0.495047      Ugly
            three  0.450813 -2.040089      Ugly
     winter one   -1.149993 -0.498148  Confused
            two    2.406824 -2.031849  Confused

[8 rows x 3 columns]

It works, but there are two problems with this method. First, the line is ugly. Needing to reset the index, then recreate the multiindex, makes this simple operation seem needlessly complicated. Second, if I understand correctly, every time I run reset_index() and set_index(), a copy of the dataframe is created. I am often working with very large dataframes, and this seems very inefficient.

Is there a better way to do this?

like image 572
Caleb Avatar asked Sep 01 '25 20:09

Caleb


1 Answers

join now allows merging of MultiIndex DataFrames with partially matching indices.

Following your example:

df1 = df1.join(df2, on=['Body','Season'])

make sure the on columns are specified in exactly the order that match the Index of the other DataFrame as the on argument matches the order you specify the labels of the calling DataFrame with the Index as it is in the other DataFrame you join to.

or just join without using on and by default it will use the common index levels between the two DataFrames:

df1 = df1.join(df2)

Resulting df1:

                          A         B      Mood
Body Season Item                               
sun  summer one   -0.483779  0.981052      Good
     winter one   -0.309939  0.803862       Bad
            two   -0.413732  0.025331       Bad
moon summer one   -0.926068 -1.316808      Ugly
            two    0.221627 -0.226154      Ugly
            three  1.064856  0.402827      Ugly
     winter one    0.526461 -0.932231  Confused
            two   -0.296415 -0.812374  Confused
like image 170
CAPSLOCK Avatar answered Sep 04 '25 02:09

CAPSLOCK