I have a dataframe (mydf) as follows:
+---+---+---+---+
| F1| F2| F3| F4|
+---+---+---+---+
| t| y4| 5|1.0|
| x| y| 1|0.5|
| x| y| 1|0.5|
| x| z| 2|1.0|
| x| b| 5|1.0|
| t| y2| 6|1.0|
| t| y3| 3|1.0|
| x| a| 4|1.0|
+---+---+---+---+
I want to do a conditional aggregation inside "withColumn
" as follows:
mydf.withColumn("myVar", if($"F3" > 3) sum($"F4") else 0.0)
that is for every row having $F3 <= 0
, myVar
should have a value of 0.0 and others sum of $"F4"
.
How to achieve it in Spark Scala?
You can use the function when
to use conditionals
import org.apache.spark.sql.functions.when
mydf.withColumn("myVar", when($"F3" > 3, $"F4").otherwise(0.0))
But I don't get what do you want to sum, since there is a single value of F4 by row
EDIT
If you want to aggregate first you can perform a groupBy
and and agg
as follows:
mydf.groupBy("F1", "F2")
.agg(sum("F3").as("F3"), sum("F4").as("F4"))
And then add the withColumn sentence just as before.
Putting all together :
mydf.groupBy("F1", "F2")
.agg(sum("F3").as("F3"), sum("F4").as("F4"))
.withColumn("myVar", when($"F3" > 3, $"F4").otherwise(0.0))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With