Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to keep column MultiIndex values when merging pandas DataFrames

I have two pandas DataFrames, as below:

df1 = pd.DataFrame({('Q1', 'SubQ1'):[1, 2, 3], ('Q1', 'SubQ2'):[1, 2, 3], ('Q2', 'SubQ1'):[1, 2, 3]})
df1['ID'] = ['a', 'b', 'c']

df2 = pd.DataFrame({'item_id': ['a', 'b', 'c'], 'url':['a.com', 'blah.com', 'company.com']})

df1:

     Q1          Q2 ID
  SubQ1 SubQ2 SubQ1   
0     1     1     1  a
1     2     2     2  b
2     3     3     3  c

df2:

  item_id          url
0       a        a.com
1       b     blah.com
2       c  company.com

Note that df1 has some columns with hierarchical indexing (eg. ('Q1', 'SubQ1')) and some with just normal indexing (eg. ID).

I want to merge these two data frames on the ID and item_id fields. Using:

result = pd.merge(df1, df2, left_on='ID', right_on='item_id')

gives:

   (Q1, SubQ1)  (Q1, SubQ2)  (Q2, SubQ1) (ID, ) item_id          url
0            1            1            1      a       a        a.com
1            2            2            2      b       b     blah.com
2            3            3            3      c       c  company.com

As you can see, the merge itself works fine, but the MultiIndex has been lost and has reverted to tuples. I've tried to recreate the MultiIndex by using pd.MultiIndex.from_tuples, as in:

result.columns = pd.MultiIndex.from_tuples(result)

but this causes problems with the item_id and url columns, taking just the first two characters of their names:

     Q1          Q2 ID  i            u
  SubQ1 SubQ2 SubQ1     t            r
0     1     1     1  a  a        a.com
1     2     2     2  b  b     blah.com
2     3     3     3  c  c  company.com

Converting the columns in df2 to be one-element tuples (ie. ('item_id',) rather than just 'item_id') makes no difference.

How can I merge these two DataFrames and keep the MultiIndex properly? Or alternatively, how can I take the result of the merge and get back to columns with a proper MultiIndex without mucking up the names of the item_id and url columns?

like image 886
robintw Avatar asked Aug 21 '17 16:08

robintw


2 Answers

If you can't beat 'em, join 'em. (Make both DataFrames have the same number of index levels before merging):

import pandas as pd

df1 = pd.DataFrame({('Q1', 'SubQ1'):[1, 2, 3], ('Q1', 'SubQ2'):[1, 2, 3], ('Q2', 'SubQ1'):[1, 2, 3]})
df1['ID'] = ['a', 'b', 'c']

df2 = pd.DataFrame({'item_id': ['a', 'b', 'c'], 'url':['a.com', 'blah.com', 'company.com']})

df2.columns = pd.MultiIndex.from_product([df2.columns, ['']])
result = pd.merge(df1, df2, left_on='ID', right_on='item_id')
print(result)

yields

     Q1          Q2 ID item_id          url
  SubQ1 SubQ2 SubQ1                        
0     1     1     1  a       a        a.com
1     2     2     2  b       b     blah.com
2     3     3     3  c       c  company.com

This also avoids the UserWarning:

pandas/core/reshape/merge.py:551: UserWarning: merging between different levels can give an unintended result (2 levels on the left, 1 on the right)

like image 120
unutbu Avatar answered Oct 19 '22 16:10

unutbu


The column for ID is not "non-hierarchical". It is signified by ('ID', ). However, pandas allows you to reference just the first level of columns in a way that looks like you are referencing a single leveled column structure. Meaning this should work df1['ID'] as well as as df1[('ID',)] as well as df1.loc[:, ('ID',)]. But if it happened to be that the top level 'ID' had more columns associated with it in the second level, df1['ID'] would return a dataframe. I feel more comfortable with this solution, which looks a lot like @JohnGalt's answer in the comments.

df1.assign(u=df1[('ID', '')].map(df2.set_index('item_id').url))

     Q1          Q2 ID            u
  SubQ1 SubQ2 SubQ1                
0     1     1     1  a        a.com
1     2     2     2  b     blah.com
2     3     3     3  c  company.com

Joining on a single level column'd dataframe to a multi-level column'd dataframe is difficult. I have to artificially add another level.

def rnm(d):
    d = d.copy()
    d.columns = [d.columns, [''] * len(d.columns)]
    return d

df1.join(rnm(df2.set_index('item_id')), on=('ID',))

     Q1          Q2 ID          url
  SubQ1 SubQ2 SubQ1                
0     1     1     1  a        a.com
1     2     2     2  b     blah.com
2     3     3     3  c  company.com
like image 21
piRSquared Avatar answered Oct 19 '22 17:10

piRSquared