I have two pandas DataFrames, as below:
df1 = pd.DataFrame({('Q1', 'SubQ1'):[1, 2, 3], ('Q1', 'SubQ2'):[1, 2, 3], ('Q2', 'SubQ1'):[1, 2, 3]})
df1['ID'] = ['a', 'b', 'c']
df2 = pd.DataFrame({'item_id': ['a', 'b', 'c'], 'url':['a.com', 'blah.com', 'company.com']})
df1
:
Q1 Q2 ID
SubQ1 SubQ2 SubQ1
0 1 1 1 a
1 2 2 2 b
2 3 3 3 c
df2
:
item_id url
0 a a.com
1 b blah.com
2 c company.com
Note that df1
has some columns with hierarchical indexing (eg. ('Q1', 'SubQ1')
) and some with just normal indexing (eg. ID
).
I want to merge these two data frames on the ID
and item_id
fields. Using:
result = pd.merge(df1, df2, left_on='ID', right_on='item_id')
gives:
(Q1, SubQ1) (Q1, SubQ2) (Q2, SubQ1) (ID, ) item_id url
0 1 1 1 a a a.com
1 2 2 2 b b blah.com
2 3 3 3 c c company.com
As you can see, the merge itself works fine, but the MultiIndex has been lost and has reverted to tuples. I've tried to recreate the MultiIndex by using pd.MultiIndex.from_tuples
, as in:
result.columns = pd.MultiIndex.from_tuples(result)
but this causes problems with the item_id
and url
columns, taking just the first two characters of their names:
Q1 Q2 ID i u
SubQ1 SubQ2 SubQ1 t r
0 1 1 1 a a a.com
1 2 2 2 b b blah.com
2 3 3 3 c c company.com
Converting the columns in df2
to be one-element tuples (ie. ('item_id',)
rather than just 'item_id'
) makes no difference.
How can I merge these two DataFrames and keep the MultiIndex properly? Or alternatively, how can I take the result of the merge and get back to columns with a proper MultiIndex without mucking up the names of the item_id
and url
columns?
If you can't beat 'em, join 'em. (Make both DataFrames have the same number of index levels before merging):
import pandas as pd
df1 = pd.DataFrame({('Q1', 'SubQ1'):[1, 2, 3], ('Q1', 'SubQ2'):[1, 2, 3], ('Q2', 'SubQ1'):[1, 2, 3]})
df1['ID'] = ['a', 'b', 'c']
df2 = pd.DataFrame({'item_id': ['a', 'b', 'c'], 'url':['a.com', 'blah.com', 'company.com']})
df2.columns = pd.MultiIndex.from_product([df2.columns, ['']])
result = pd.merge(df1, df2, left_on='ID', right_on='item_id')
print(result)
yields
Q1 Q2 ID item_id url
SubQ1 SubQ2 SubQ1
0 1 1 1 a a a.com
1 2 2 2 b b blah.com
2 3 3 3 c c company.com
This also avoids the UserWarning
:
pandas/core/reshape/merge.py:551: UserWarning: merging between different levels can give an unintended result (2 levels on the left, 1 on the right)
The column for ID
is not "non-hierarchical". It is signified by ('ID', )
. However, pandas
allows you to reference just the first level of columns in a way that looks like you are referencing a single leveled column structure. Meaning this should work df1['ID']
as well as as df1[('ID',)]
as well as df1.loc[:, ('ID',)]
. But if it happened to be that the top level 'ID'
had more columns associated with it in the second level, df1['ID']
would return a dataframe. I feel more comfortable with this solution, which looks a lot like @JohnGalt's answer in the comments.
df1.assign(u=df1[('ID', '')].map(df2.set_index('item_id').url))
Q1 Q2 ID u
SubQ1 SubQ2 SubQ1
0 1 1 1 a a.com
1 2 2 2 b blah.com
2 3 3 3 c company.com
Joining on a single level column'd dataframe to a multi-level column'd dataframe is difficult. I have to artificially add another level.
def rnm(d):
d = d.copy()
d.columns = [d.columns, [''] * len(d.columns)]
return d
df1.join(rnm(df2.set_index('item_id')), on=('ID',))
Q1 Q2 ID url
SubQ1 SubQ2 SubQ1
0 1 1 1 a a.com
1 2 2 2 b blah.com
2 3 3 3 c company.com
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With