I am new to using DataFrame and I would like to know how to perform a SQL equivalent of left outer join on multiple columns on a series of tables
Example:
df1: Year Week Colour Val1 2014 A Red 50 2014 B Red 60 2014 B Black 70 2014 C Red 10 2014 D Green 20 df2: Year Week Colour Val2 2014 A Black 30 2014 B Black 100 2014 C Green 50 2014 C Red 20 2014 D Red 40 df3: Year Week Colour Val3 2013 B Red 60 2013 C Black 80 2013 B Black 10 2013 D Green 20 2013 D Red 50
Essentially I want to do something like this SQL code (Notice that df3 is not joined on Year):
SELECT df1.*, df2.Val2, df3.Val3 FROM df1 LEFT OUTER JOIN df2 ON df1.Year = df2.Year AND df1.Week = df2.Week AND df1.Colour = df2.Colour LEFT OUTER JOIN df3 ON df1.Week = df3.Week AND df1.Colour = df3.Colour
The result should look like:
Year Week Colour Val1 Val2 Val3 2014 A Red 50 Null Null 2014 B Red 60 Null 60 2014 B Black 70 100 Null 2014 C Red 10 20 Null 2014 D Green 20 Null Null
I have tried using merge and join but can't figure out how to do it on multiple tables and when there are multiple joints involved. Could someone help me on this please?
Thanks
We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes. Merging multiple Dataframes is similar to SQL join and supports different types of join inner , left , right , outer , cross .
A left join, or left merge, keeps every row from the left dataframe. Result from left-join or left-merge of two dataframes in Pandas. Rows in the left dataframe that have no corresponding join value in the right dataframe are left with NaN values.
Merge them in two steps, df1
and df2
first, and then the result of that to df3
.
In [33]: s1 = pd.merge(df1, df2, how='left', on=['Year', 'Week', 'Colour'])
I dropped year from df3 since you don't need it for the last join.
In [39]: df = pd.merge(s1, df3[['Week', 'Colour', 'Val3']], how='left', on=['Week', 'Colour']) In [40]: df Out[40]: Year Week Colour Val1 Val2 Val3 0 2014 A Red 50 NaN NaN 1 2014 B Red 60 NaN 60 2 2014 B Black 70 100 10 3 2014 C Red 10 20 NaN 4 2014 D Green 20 NaN 20 [5 rows x 6 columns]
One can also do this with a compact version of @TomAugspurger's answer, like so:
df = df1.merge(df2, how='left', on=['Year', 'Week', 'Colour']).merge(df3[['Week', 'Colour', 'Val3']], how='left', on=['Week', 'Colour'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With