Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do conditional "withColumn" in a Spark dataframe?

I have a dataframe (mydf) as follows:

+---+---+---+---+
| F1| F2| F3| F4|
+---+---+---+---+
|  t| y4|  5|1.0|
|  x|  y|  1|0.5|
|  x|  y|  1|0.5|
|  x|  z|  2|1.0|
|  x|  b|  5|1.0|
|  t| y2|  6|1.0|
|  t| y3|  3|1.0|
|  x|  a|  4|1.0|
+---+---+---+---+

I want to do a conditional aggregation inside "withColumn" as follows:

mydf.withColumn("myVar", if($"F3" > 3) sum($"F4") else 0.0)

that is for every row having $F3 <= 0 , myVar should have a value of 0.0 and others sum of $"F4".

How to achieve it in Spark Scala?

like image 317
user3243499 Avatar asked Nov 07 '18 14:11

user3243499


1 Answers

You can use the function when to use conditionals

import org.apache.spark.sql.functions.when
mydf.withColumn("myVar", when($"F3" > 3, $"F4").otherwise(0.0))

But I don't get what do you want to sum, since there is a single value of F4 by row

EDIT If you want to aggregate first you can perform a groupBy and and agg as follows:

mydf.groupBy("F1", "F2")
.agg(sum("F3").as("F3"), sum("F4").as("F4"))

And then add the withColumn sentence just as before.

Putting all together :

   mydf.groupBy("F1", "F2")
    .agg(sum("F3").as("F3"), sum("F4").as("F4"))
    .withColumn("myVar", when($"F3" > 3, $"F4").otherwise(0.0))
like image 110
SCouto Avatar answered Sep 18 '22 20:09

SCouto