Could you please help me with joining two DataFrames.
I have two DataFrames.
df1:
index val1 val2
--------------------
1 str1 abc1
2 str2 abc2
3 str3 abc3
4 str4 abc9
5 str5 abc4
df2:
index val2
------------
1 abc1
2 abc2
4 abc3
5 abc4
9 abc5
I need to create a DataFrame based on the previous two with a left join by two columns. Columns index and val2 have the same names in both DataFrames. The result of df3 should look like this:
index val1 val2 val3
----------------------------
1 str1 abc1 1
2 str2 abc2 1
3 str3 abc3 NaN
4 str4 abc9 NaN
5 str5 abc4 1
Indexes from df2 that are not present in df1 should be dropped, if an index in df1 has the same val2 as in df2 then 1 should be added to a new column val3, else: NaN.
Huge thanks in advance!
This is one way. As below, I recommend you use Boolean rather than float
for val3
, as this what the series represents.
# merge and set index
res = df1.merge(df2, how='left').set_index('index')
# map val2 from df2
res['val3'] = df2.set_index('index')['val2']
# check for equality of val3 and val2
res['val3'] = res['val3'] == res['val2']
print(res)
val1 val2 val3
index
1 str1 abc1 True
2 str2 abc2 True
3 str3 abc3 False
4 str4 abc9 False
5 str5 abc4 True
You can try using join
(default is left) with rsuffix
so that column name for df2 is renamed with suffix. Then, using np.where
to check if column value matches and assigning value for val3
column.
import numpy as np
df = df1.join(df2, rsuffix='_df2')
df['val3'] = np.where(df.val2 == df.val2_df2, 1, np.NaN)
del df['val2_df2']
print(df)
Result:
val1 val2 val3
index
1 str1 abc1 1.0
2 str2 abc2 1.0
3 str3 abc3 NaN
4 str4 abc9 NaN
5 str5 abc4 1.0
Since you want to merge on a combination of indices and columns you can either add them all to the index, or reset_index
before the merge. We'll also assign the val3
column to df2 so it gets merged over.
(df1.reset_index().merge(
df2.reset_index().assign(val3 = 1), on=['index', 'val2'], how='left')
.set_index('index'))
Outputs:
val1 val2 val3
index
1 str1 abc1 1.0
2 str2 abc2 1.0
3 str3 abc3 NaN
4 str4 abc9 NaN
5 str5 abc4 1.0
If 'index'
is just a column and not the index, it's as simple as specifying two keys to merge on.
df1.merge(df2.assign(val3 = 1), on=['index', 'val2'], how='left')
Outputs:
index val1 val2 val3
0 1 str1 abc1 1.0
1 2 str2 abc2 1.0
2 3 str3 abc3 NaN
3 4 str4 abc9 NaN
4 5 str5 abc4 1.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With