How to do conditional "withColumn" in a Spark dataframe?

Question

I have a dataframe (mydf) as follows:

+---+---+---+---+
| F1| F2| F3| F4|
+---+---+---+---+
|  t| y4|  5|1.0|
|  x|  y|  1|0.5|
|  x|  y|  1|0.5|
|  x|  z|  2|1.0|
|  x|  b|  5|1.0|
|  t| y2|  6|1.0|
|  t| y3|  3|1.0|
|  x|  a|  4|1.0|
+---+---+---+---+

I want to do a conditional aggregation inside "withColumn" as follows:

mydf.withColumn("myVar", if($"F3" > 3) sum($"F4") else 0.0)

that is for every row having $F3 <= 0 , myVar should have a value of 0.0 and others sum of $"F4".

How to achieve it in Spark Scala?

SCouto · Accepted Answer

You can use the function when to use conditionals

import org.apache.spark.sql.functions.when
mydf.withColumn("myVar", when($"F3" > 3, $"F4").otherwise(0.0))

But I don't get what do you want to sum, since there is a single value of F4 by row

EDIT If you want to aggregate first you can perform a groupBy and and agg as follows:

mydf.groupBy("F1", "F2")
.agg(sum("F3").as("F3"), sum("F4").as("F4"))

And then add the withColumn sentence just as before.

Putting all together :

   mydf.groupBy("F1", "F2")
    .agg(sum("F3").as("F3"), sum("F4").as("F4"))
    .withColumn("myVar", when($"F3" > 3, $"F4").otherwise(0.0))

How to do conditional "withColumn" in a Spark dataframe?

Tags:

scala

apache-spark

apache-spark-sql

user3243499

1 Answers

SCouto

Recent Activity

Donate For Us

How to do conditional "withColumn" in a Spark dataframe?

Tags:

scala

apache-spark

apache-spark-sql

user3243499

1 Answers

SCouto

Related questions

Recent Activity

Donate For Us