I have two dataframes, df1
and df2
:
df1.show()
+---+--------+-----+----+--------+
|cA | cB | cC | cD | cE |
+---+--------+-----+----+--------+
| A| abc | 0.1 | 0.0| 0 |
| B| def | 0.15| 0.5| 0 |
| C| ghi | 0.2 | 0.2| 1 |
| D| jkl | 1.1 | 0.1| 0 |
| E| mno | 0.1 | 0.1| 0 |
+---+--------+-----+----+--------+
df2.show()
+---+--------+-----+----+--------+
|cA | cB | cH | cI | cJ |
+---+--------+-----+----+--------+
| A| abc | a | b | ? |
| C| ghi | a | c | ? |
+---+--------+-----+----+--------+
I would like to update cE
column in df1
and set it to 1
, if the row is referenced in df2
. Each record is identified by cA
and cB
columns.
Below is the desired output; Note that the cE
value of the first record was updated to 1
:
+---+--------+-----+----+--------+
|cA | cB | cC | cD | cE |
+---+--------+-----+----+--------+
| A| abc | 0.1 | 0.0| 1 |
| B| def | 0.15| 0.5| 0 |
| C| ghi | 0.2 | 0.2| 1 |
| D| jkl | 1.1 | 0.1| 0 |
| E| mno | 0.1 | 0.1| 0 |
+---+--------+-----+----+--------+
When there is scenario of updating a column value based on another column, then the when clause comes handy. Please Refer the when and otherwise clause.
import pyspark.sql.functions as F
df3=df1.join(df2,(df1.cA==df2.cA)&(df1.cB==df2.cB),"full").withColumn('cE',F.when((df1.cA==df2.cA)&(df1.cB==df2.cB),1).otherwise(0)).select(df1.cA,df1.cB,df1.cC,df1.cD,'cE')
df3.show()
+---+---+----+---+---+
| cA| cB| cC| cD| cE|
+---+---+----+---+---+
| E|mno| 0.1|0.1| 0|
| B|def|0.15|0.5| 0|
| C|ghi| 0.2|0.2| 1|
| A|abc| 0.1|0.0| 1|
| D|jkl| 1.1|0.1| 0|
+---+---+----+---+---+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With