Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas left join DataFrames by two columns

Could you please help me with joining two DataFrames.

I have two DataFrames.

df1:

index   val1    val2
--------------------
1       str1    abc1
2       str2    abc2
3       str3    abc3
4       str4    abc9
5       str5    abc4

df2:

index   val2
------------
1       abc1
2       abc2
4       abc3 
5       abc4
9       abc5

I need to create a DataFrame based on the previous two with a left join by two columns. Columns index and val2 have the same names in both DataFrames. The result of df3 should look like this:

index   val1    val2    val3
----------------------------
1       str1    abc1    1
2       str2    abc2    1
3       str3    abc3    NaN
4       str4    abc9    NaN
5       str5    abc4    1

Indexes from df2 that are not present in df1 should be dropped, if an index in df1 has the same val2 as in df2 then 1 should be added to a new column val3, else: NaN.

Huge thanks in advance!

like image 886
Andrei Avatar asked Jun 16 '18 15:06

Andrei


3 Answers

This is one way. As below, I recommend you use Boolean rather than float for val3, as this what the series represents.

# merge and set index
res = df1.merge(df2, how='left').set_index('index')

# map val2 from df2
res['val3'] = df2.set_index('index')['val2']

# check for equality of val3 and val2
res['val3'] = res['val3'] == res['val2']

print(res)

       val1  val2   val3
index                   
1      str1  abc1   True
2      str2  abc2   True
3      str3  abc3  False
4      str4  abc9  False
5      str5  abc4   True
like image 169
jpp Avatar answered Nov 07 '22 21:11

jpp


You can try using join (default is left) with rsuffix so that column name for df2 is renamed with suffix. Then, using np.where to check if column value matches and assigning value for val3 column.

import numpy as np

df = df1.join(df2, rsuffix='_df2')
df['val3'] = np.where(df.val2 == df.val2_df2, 1, np.NaN)
del df['val2_df2']
print(df)

Result:

       val1  val2  val3
index                  
1      str1  abc1   1.0
2      str2  abc2   1.0
3      str3  abc3   NaN
4      str4  abc9   NaN
5      str5  abc4   1.0
like image 24
student Avatar answered Nov 07 '22 22:11

student


Since you want to merge on a combination of indices and columns you can either add them all to the index, or reset_index before the merge. We'll also assign the val3 column to df2 so it gets merged over.

(df1.reset_index().merge(
     df2.reset_index().assign(val3 = 1), on=['index', 'val2'], how='left')
    .set_index('index'))

Outputs:

       val1  val2  val3
index                  
1      str1  abc1   1.0
2      str2  abc2   1.0
3      str3  abc3   NaN
4      str4  abc9   NaN
5      str5  abc4   1.0

If 'index' is just a column and not the index, it's as simple as specifying two keys to merge on.

df1.merge(df2.assign(val3 = 1), on=['index', 'val2'], how='left')

Outputs:

   index  val1  val2  val3
0      1  str1  abc1   1.0
1      2  str2  abc2   1.0
2      3  str3  abc3   NaN
3      4  str4  abc9   NaN
4      5  str5  abc4   1.0
like image 23
ALollz Avatar answered Nov 07 '22 21:11

ALollz