Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Equivalent of IF Then ELSE

I have seen this question earlier here and I have took lessons from that. However I am not sure why I am getting an error when I feel it should work.

I want to create a new column in existing Spark DataFrame by some rules. Here is what I wrote. iris_spark is the data frame with a categorical variable iris_spark with three distinct categories.

from pyspark.sql import functions as F  iris_spark_df = iris_spark.withColumn(     "Class",     F.when(iris_spark.iris_class == 'Iris-setosa', 0, F.when(iris_spark.iris_class == 'Iris-versicolor',1)).otherwise(2)) 

Throws the following error.

--------------------------------------------------------------------------- TypeError                                 Traceback (most recent call last) <ipython-input-157-21818c7dc060> in <module>() ----> 1 iris_spark_df=iris_spark.withColumn("Class",F.when(iris_spark.iris_class=='Iris-setosa',0,F.when(iris_spark.iris_class=='Iris-versicolor',1)))  TypeError: when() takes exactly 2 arguments (3 given)   --------------------------------------------------------------------------- TypeError                                 Traceback (most recent call last) <ipython-input-157-21818c7dc060> in <module>() ----> 1 iris_spark_df=iris_spark.withColumn("Class",F.when(iris_spark.iris_class=='Iris-setosa',0,F.when(iris_spark.iris_class=='Iris-versicolor',1)))  TypeError: when() takes exactly 2 arguments (3 given) 

Any idea why?

like image 712
Baktaawar Avatar asked Aug 19 '16 21:08

Baktaawar


People also ask

How do you write if else in Spark?

you can use this: if(exp1, exp2, exp3) inside spark. sql() where exp1 is condition and if true give me exp2, else give me exp3.

Does Spark support if statement?

Like SQL "case when" statement and “ Swith" , "if then else" statement from popular programming languages, Spark SQL Dataframe also supports similar syntax using “ when otherwise ” or we can also use “ case when ” statement.

What is === in PySpark?

The triple equals operator === is normally the Scala type-safe equals operator, analogous to the one in Javascript. Spark overrides this with a method in Column to create a new Column object that compares the Column to the left with the object on the right, returning a boolean.


2 Answers

Correct structure is either:

(when(col("iris_class") == 'Iris-setosa', 0) .when(col("iris_class") == 'Iris-versicolor', 1) .otherwise(2)) 

which is equivalent to

CASE      WHEN (iris_class = 'Iris-setosa') THEN 0     WHEN (iris_class = 'Iris-versicolor') THEN 1      ELSE 2 END 

or:

(when(col("iris_class") == 'Iris-setosa', 0)     .otherwise(when(col("iris_class") == 'Iris-versicolor', 1)         .otherwise(2))) 

which is equivalent to:

CASE WHEN (iris_class = 'Iris-setosa') THEN 0       ELSE CASE WHEN (iris_class = 'Iris-versicolor') THEN 1                 ELSE 2            END  END 

with general syntax:

when(condition, value).when(...) 

or

when(condition, value).otherwise(...) 

You probably mixed up things with Hive IF conditional:

IF(condition, if-true, if-false) 

which can be used only in raw SQL with Hive support.

like image 102
zero323 Avatar answered Oct 02 '22 14:10

zero323


Conditional statement In Spark

  • Using “when otherwise” on DataFrame
  • Using “case when” on DataFrame
  • Using && and || operator

import org.apache.spark.sql.functions.{when, _} import spark.sqlContext.implicits._  val spark: SparkSession = SparkSession.builder().master("local[1]").appName("SparkByExamples.com").getOrCreate()  val data = List(("James ","","Smith","36636","M",60000),         ("Michael ","Rose","","40288","M",70000),         ("Robert ","","Williams","42114","",400000),         ("Maria ","Anne","Jones","39192","F",500000),         ("Jen","Mary","Brown","","F",0))  val cols = Seq("first_name","middle_name","last_name","dob","gender","salary") val df = spark.createDataFrame(data).toDF(cols:_*) 

1. Using “when otherwise” on DataFrame

Replace the value of gender with new value

val df1 = df.withColumn("new_gender", when(col("gender") === "M","Male")       .when(col("gender") === "F","Female")       .otherwise("Unknown"))  val df2 = df.select(col("*"), when(col("gender") === "M","Male")       .when(col("gender") === "F","Female")       .otherwise("Unknown").alias("new_gender")) 

2. Using “case when” on DataFrame

val df3 = df.withColumn("new_gender",   expr("case when gender = 'M' then 'Male' " +                    "when gender = 'F' then 'Female' " +                    "else 'Unknown' end")) 

Alternatively,

val df4 = df.select(col("*"),       expr("case when gender = 'M' then 'Male' " +                        "when gender = 'F' then 'Female' " +                        "else 'Unknown' end").alias("new_gender")) 

3. Using && and || operator

val dataDF = Seq(       (66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4"       )).toDF("id", "code", "amt") dataDF.withColumn("new_column",        when(col("code") === "a" || col("code") === "d", "A")       .when(col("code") === "b" && col("amt") === "4", "B")       .otherwise("A1"))       .show() 

Output:

+---+----+---+----------+ | id|code|amt|new_column| +---+----+---+----------+ | 66|   a|  4|         A| | 67|   a|  0|         A| | 70|   b|  4|         B| | 71|   d|  4|         A| +---+----+---+----------+ 
like image 31
vj sreenivasan Avatar answered Oct 02 '22 14:10

vj sreenivasan