I am dealing with transforming SQL code to PySpark code and came across some SQL statements. I don't know how to approach case statments in pyspark? I am planning on creating a RDD and then using rdd.map and then do some logic checks. Is that the right approach? Please help!
Basically I need to go through each line in the RDD or DF and based on some logic I need to edit one of the column values.
case when (e."a" Like 'a%' Or e."b" Like 'b%') And e."aa"='BW' And cast(e."abc" as decimal(10,4))=75.0 Then 'callitA' when (e."a" Like 'b%' Or e."b" Like 'a%') And e."aa"='AW' And cast(e."abc" as decimal(10,4))=75.0 Then 'callitB' else 'CallitC'
Like SQL "case when" statement and “ Swith" , "if then else" statement from popular programming languages, Spark SQL Dataframe also supports similar syntax using “ when otherwise ” or we can also use “ case when ” statement.
PySpark SQL Case When – This is mainly similar to SQL expression, Usage: CASE WHEN cond1 THEN result WHEN cond2 THEN result... ELSE result END. The PySpark SQL import and functions package is imported in the environment to Define when() and otherwise() function as a dataframe into Parquet file format in PySpark.
Spark can be case sensitive, but it is case insensitive by default. In order to avoid potential data corruption or data loss, duplicate column names are not allowed.
caseSensitive is set to false , Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values.
These are few ways to write If-Else
/ When-Then-Else
/ When-Otherwise
expression in pyspark
.
Sample dataframe
df = spark.createDataFrame([(1,1),(2,2),(3,3)],['id','value']) df.show() #+---+-----+ #| id|value| #+---+-----+ #| 1| 1| #| 2| 2| #| 3| 3| #+---+-----+ #Desired Output: #+---+-----+----------+ #| id|value|value_desc| #+---+-----+----------+ #| 1| 1| one| #| 2| 2| two| #| 3| 3| other| #+---+-----+----------+
Option#1: withColumn()
using when-otherwise
from pyspark.sql.functions import when df.withColumn("value_desc",when(df.value == 1, 'one').when(df.value == 2, 'two').otherwise('other')).show()
Option#2: select()
using when-otherwise
from pyspark.sql.functions import when df.select("*",when(df.value == 1, 'one').when(df.value == 2, 'two').otherwise('other').alias('value_desc')).show()
Option3: selectExpr()
using SQL equivalent CASE expression
df.selectExpr("*","CASE WHEN value == 1 THEN 'one' WHEN value == 2 THEN 'two' ELSE 'other' END AS value_desc").show()
SQL like expression can also be written in withColumn()
and select()
using pyspark.sql.functions.expr function. Here are examples.
Option4: select()
using expr function
from pyspark.sql.functions import expr df.select("*",expr("CASE WHEN value == 1 THEN 'one' WHEN value == 2 THEN 'two' ELSE 'other' END AS value_desc")).show()
Option5: withColumn()
using expr function
from pyspark.sql.functions import expr df.withColumn("value_desc",expr("CASE WHEN value == 1 THEN 'one' WHEN value == 2 THEN 'two' ELSE 'other' END AS value_desc")).show()
Output:
#+---+-----+----------+ #| id|value|value_desc| #+---+-----+----------+ #| 1| 1| one| #| 2| 2| two| #| 3| 3| other| #+---+-----+----------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With