Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache spark dealing with case statements

Tags:

I am dealing with transforming SQL code to PySpark code and came across some SQL statements. I don't know how to approach case statments in pyspark? I am planning on creating a RDD and then using rdd.map and then do some logic checks. Is that the right approach? Please help!

Basically I need to go through each line in the RDD or DF and based on some logic I need to edit one of the column values.

     case                  when (e."a" Like 'a%' Or e."b" Like 'b%')                  And e."aa"='BW' And cast(e."abc" as decimal(10,4))=75.0 Then 'callitA'                 when (e."a" Like 'b%' Or e."b" Like 'a%')                  And e."aa"='AW' And cast(e."abc" as decimal(10,4))=75.0 Then 'callitB'  else  'CallitC' 
like image 472
Amar Singh Avatar asked Oct 11 '16 16:10

Amar Singh


People also ask

Can we use CASE statement in Spark SQL?

Like SQL "case when" statement and “ Swith" , "if then else" statement from popular programming languages, Spark SQL Dataframe also supports similar syntax using “ when otherwise ” or we can also use “ case when ” statement.

How do you use a case in PySpark?

PySpark SQL Case When – This is mainly similar to SQL expression, Usage: CASE WHEN cond1 THEN result WHEN cond2 THEN result... ELSE result END. The PySpark SQL import and functions package is imported in the environment to Define when() and otherwise() function as a dataframe into Parquet file format in PySpark.

Are Spark columns case sensitive?

Spark can be case sensitive, but it is case insensitive by default. In order to avoid potential data corruption or data loss, duplicate column names are not allowed.

Is PySpark like case sensitive?

caseSensitive is set to false , Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values.


1 Answers

These are few ways to write If-Else / When-Then-Else / When-Otherwise expression in pyspark.

Sample dataframe

df = spark.createDataFrame([(1,1),(2,2),(3,3)],['id','value'])  df.show()  #+---+-----+ #| id|value| #+---+-----+ #|  1|    1| #|  2|    2| #|  3|    3| #+---+-----+  #Desired Output: #+---+-----+----------+ #| id|value|value_desc| #+---+-----+----------+ #|  1|    1|       one| #|  2|    2|       two| #|  3|    3|     other| #+---+-----+----------+ 

Option#1: withColumn() using when-otherwise

from pyspark.sql.functions import when  df.withColumn("value_desc",when(df.value == 1, 'one').when(df.value == 2, 'two').otherwise('other')).show() 

Option#2: select() using when-otherwise

from pyspark.sql.functions import when  df.select("*",when(df.value == 1, 'one').when(df.value == 2, 'two').otherwise('other').alias('value_desc')).show() 

Option3: selectExpr() using SQL equivalent CASE expression

df.selectExpr("*","CASE WHEN value == 1 THEN  'one' WHEN value == 2 THEN  'two' ELSE 'other' END AS value_desc").show() 

SQL like expression can also be written in withColumn() and select() using pyspark.sql.functions.expr function. Here are examples.

Option4: select() using expr function

from pyspark.sql.functions import expr   df.select("*",expr("CASE WHEN value == 1 THEN  'one' WHEN value == 2 THEN  'two' ELSE 'other' END AS value_desc")).show() 

Option5: withColumn() using expr function

from pyspark.sql.functions import expr   df.withColumn("value_desc",expr("CASE WHEN value == 1 THEN  'one' WHEN value == 2 THEN  'two' ELSE 'other' END AS value_desc")).show() 

Output:

#+---+-----+----------+ #| id|value|value_desc| #+---+-----+----------+ #|  1|    1|       one| #|  2|    2|       two| #|  3|    3|     other| #+---+-----+----------+ 
like image 130
Shantanu Sharma Avatar answered Oct 04 '22 02:10

Shantanu Sharma