Pandas merge() function is used to merge multiple Dataframes. We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes.
Key Points Pandas' merge and concat can be used to combine subsets of a DataFrame, or even data from different files. join function combines DataFrames based on index or column. Joining two DataFrames can be done in multiple ways (left, right, and inner) depending on what data must be in the final DataFrame.
Joining DataFrames Another way to combine DataFrames is to use columns in each dataset that contain common values (a common unique id). Combining DataFrames using a common field is called “joining”. The columns containing the common values are called “join key(s)”.
Use merge
, which is an inner join by default:
pd.merge(df1, df2, left_index=True, right_index=True)
Or join
, which is a left join by default:
df1.join(df2)
Or concat
), which is an outer join by default:
pd.concat([df1, df2], axis=1)
Samples:
df1 = pd.DataFrame({'a':range(6),
'b':[5,3,6,9,2,4]}, index=list('abcdef'))
print (df1)
a b
a 0 5
b 1 3
c 2 6
d 3 9
e 4 2
f 5 4
df2 = pd.DataFrame({'c':range(4),
'd':[10,20,30, 40]}, index=list('abhi'))
print (df2)
c d
a 0 10
b 1 20
h 2 30
i 3 40
# Default inner join
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
print (df3)
a b c d
a 0 5 0 10
b 1 3 1 20
# Default left join
df4 = df1.join(df2)
print (df4)
a b c d
a 0 5 0.0 10.0
b 1 3 1.0 20.0
c 2 6 NaN NaN
d 3 9 NaN NaN
e 4 2 NaN NaN
f 5 4 NaN NaN
# Default outer join
df5 = pd.concat([df1, df2], axis=1)
print (df5)
a b c d
a 0.0 5.0 0.0 10.0
b 1.0 3.0 1.0 20.0
c 2.0 6.0 NaN NaN
d 3.0 9.0 NaN NaN
e 4.0 2.0 NaN NaN
f 5.0 4.0 NaN NaN
h NaN NaN 2.0 30.0
i NaN NaN 3.0 40.0
You can use concat([df1, df2, ...], axis=1) in order to concatenate two or more DFs aligned by indexes:
pd.concat([df1, df2, df3, ...], axis=1)
Or merge for concatenating by custom fields / indexes:
# join by _common_ columns: `col1`, `col3`
pd.merge(df1, df2, on=['col1','col3'])
# join by: `df1.col1 == df2.index`
pd.merge(df1, df2, left_on='col1' right_index=True)
or join for joining by index:
df1.join(df2)
By default:join
is a column-wise left joinpd.merge
is a column-wise inner joinpd.concat
is a row-wise outer join
pd.concat
:
takes Iterable arguments. Thus, it cannot take DataFrames directly (use [df,df2]
)
Dimensions of DataFrame should match along axis
Join
and pd.merge
:
can take DataFrame arguments
A silly bug that got me: the joins failed because index dtypes
differed. This was not obvious as both tables were pivot tables of the same original table. After reset_index
, the indices looked identical in Jupyter. It only came to light when saving to Excel...
I fixed it with: df1[['key']] = df1[['key']].apply(pd.to_numeric)
Hopefully this saves somebody an hour!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With