I have seen this question earlier here and I have took lessons from that. However I am not sure why I am getting an error when I feel it should work.
I want to create a new column in existing Spark DataFrame
by some rules. Here is what I wrote. iris_spark is the data frame with a categorical variable iris_spark with three distinct categories.
from pyspark.sql import functions as F iris_spark_df = iris_spark.withColumn( "Class", F.when(iris_spark.iris_class == 'Iris-setosa', 0, F.when(iris_spark.iris_class == 'Iris-versicolor',1)).otherwise(2))
Throws the following error.
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-157-21818c7dc060> in <module>() ----> 1 iris_spark_df=iris_spark.withColumn("Class",F.when(iris_spark.iris_class=='Iris-setosa',0,F.when(iris_spark.iris_class=='Iris-versicolor',1))) TypeError: when() takes exactly 2 arguments (3 given) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-157-21818c7dc060> in <module>() ----> 1 iris_spark_df=iris_spark.withColumn("Class",F.when(iris_spark.iris_class=='Iris-setosa',0,F.when(iris_spark.iris_class=='Iris-versicolor',1))) TypeError: when() takes exactly 2 arguments (3 given)
Any idea why?
you can use this: if(exp1, exp2, exp3) inside spark. sql() where exp1 is condition and if true give me exp2, else give me exp3.
Like SQL "case when" statement and “ Swith" , "if then else" statement from popular programming languages, Spark SQL Dataframe also supports similar syntax using “ when otherwise ” or we can also use “ case when ” statement.
The triple equals operator === is normally the Scala type-safe equals operator, analogous to the one in Javascript. Spark overrides this with a method in Column to create a new Column object that compares the Column to the left with the object on the right, returning a boolean.
Correct structure is either:
(when(col("iris_class") == 'Iris-setosa', 0) .when(col("iris_class") == 'Iris-versicolor', 1) .otherwise(2))
which is equivalent to
CASE WHEN (iris_class = 'Iris-setosa') THEN 0 WHEN (iris_class = 'Iris-versicolor') THEN 1 ELSE 2 END
or:
(when(col("iris_class") == 'Iris-setosa', 0) .otherwise(when(col("iris_class") == 'Iris-versicolor', 1) .otherwise(2)))
which is equivalent to:
CASE WHEN (iris_class = 'Iris-setosa') THEN 0 ELSE CASE WHEN (iris_class = 'Iris-versicolor') THEN 1 ELSE 2 END END
with general syntax:
when(condition, value).when(...)
or
when(condition, value).otherwise(...)
You probably mixed up things with Hive IF
conditional:
IF(condition, if-true, if-false)
which can be used only in raw SQL with Hive support.
Conditional statement In Spark
import org.apache.spark.sql.functions.{when, _} import spark.sqlContext.implicits._ val spark: SparkSession = SparkSession.builder().master("local[1]").appName("SparkByExamples.com").getOrCreate() val data = List(("James ","","Smith","36636","M",60000), ("Michael ","Rose","","40288","M",70000), ("Robert ","","Williams","42114","",400000), ("Maria ","Anne","Jones","39192","F",500000), ("Jen","Mary","Brown","","F",0)) val cols = Seq("first_name","middle_name","last_name","dob","gender","salary") val df = spark.createDataFrame(data).toDF(cols:_*)
1. Using “when otherwise” on DataFrame
Replace the value of gender with new value
val df1 = df.withColumn("new_gender", when(col("gender") === "M","Male") .when(col("gender") === "F","Female") .otherwise("Unknown")) val df2 = df.select(col("*"), when(col("gender") === "M","Male") .when(col("gender") === "F","Female") .otherwise("Unknown").alias("new_gender"))
2. Using “case when” on DataFrame
val df3 = df.withColumn("new_gender", expr("case when gender = 'M' then 'Male' " + "when gender = 'F' then 'Female' " + "else 'Unknown' end"))
Alternatively,
val df4 = df.select(col("*"), expr("case when gender = 'M' then 'Male' " + "when gender = 'F' then 'Female' " + "else 'Unknown' end").alias("new_gender"))
3. Using && and || operator
val dataDF = Seq( (66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4" )).toDF("id", "code", "amt") dataDF.withColumn("new_column", when(col("code") === "a" || col("code") === "d", "A") .when(col("code") === "b" && col("amt") === "4", "B") .otherwise("A1")) .show()
Output:
+---+----+---+----------+ | id|code|amt|new_column| +---+----+---+----------+ | 66| a| 4| A| | 67| a| 0| A| | 70| b| 4| B| | 71| d| 4| A| +---+----+---+----------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With