Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark, add an "CASE WHEN ... ELSE ..." calculated column to an existing DataFrame

Tags:

I'm trying to add an "CASE WHEN ... ELSE ..." calculated column to an existing DataFrame, using Scala APIs. Starting dataframe:

color Red Green Blue 

Desired dataframe (SQL syntax: CASE WHEN color == Green THEN 1 ELSE 0 END AS bool):

color bool Red   0 Green 1 Blue  0 

How should I implement this logic?

like image 774
Leonardo Biagioli Avatar asked Jun 11 '15 14:06

Leonardo Biagioli


People also ask

How do you write if else condition in PySpark?

PySpark When Otherwise – when() is a SQL function that returns a Column type and otherwise() is a function of Column, if otherwise() is not used, it returns a None/NULL value. PySpark SQL Case When – This is similar to SQL expression, Usage: CASE WHEN cond1 THEN result WHEN cond2 THEN result... ELSE result END .

How do I add a column to an existing DataFrame in Spark?

In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .


2 Answers

In the upcoming SPARK 1.4.0 release (should be released in the next couple of days). You can use the when/otherwise syntax:

// Create the dataframe val df = Seq("Red", "Green", "Blue").map(Tuple1.apply).toDF("color")  // Use when/otherwise syntax val df1 = df.withColumn("Green_Ind", when($"color" === "Green", 1).otherwise(0)) 

If you are using SPARK 1.3.0 you can chose to use a UDF:

// Define the UDF val isGreen = udf((color: String) => {   if (color == "Green") 1   else 0 }) val df2 = df.withColumn("Green_Ind", isGreen($"color")) 
like image 87
Herman Avatar answered Sep 23 '22 01:09

Herman


In Spark 1.5.0: you can also use the SQL syntax expr function

val df3 = df.withColumn("Green_Ind", expr("case when color = 'green' then 1 else 0 end")) 

or plain spark-sql

df.registerTempTable("data") val df4 = sql(""" select *, case when color = 'green' then 1 else 0 end as Green_ind from data """) 
like image 22
Robert Chevallier Avatar answered Sep 22 '22 01:09

Robert Chevallier