I have seen this question earlier here and I have took lessons from that. However I am not sure why I am getting an error when I feel it should work. I want to create a new column in existing Spark <code>DataFrame</code> by some rules. Here is what I wrote. iris_spark is the data frame with a categorical variable iris_spark with three distinct categories. <pre class="prettyprint"><code>from pyspark.sql import functions as F iris_spark_df = iris_spark.withColumn( "Class", F.when(iris_spark.iris_class == 'Iris-setosa', 0, F.when(iris_spark.iris_class == 'Iris-versicolor',1)).otherwise(2)) </code></pre> Throws the following error. <pre class="prettyprint"><code>--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-157-21818c7dc060> in <module>() ----> 1 iris_spark_df=iris_spark.withColumn("Class",F.when(iris_spark.iris_class=='Iris-setosa',0,F.when(iris_spark.iris_class=='Iris-versicolor',1))) TypeError: when() takes exactly 2 arguments (3 given) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-157-21818c7dc060> in <module>() ----> 1 iris_spark_df=iris_spark.withColumn("Class",F.when(iris_spark.iris_class=='Iris-setosa',0,F.when(iris_spark.iris_class=='Iris-versicolor',1))) TypeError: when() takes exactly 2 arguments (3 given) </code></pre> Any idea why?

Correct structure is either: <pre class="prettyprint"><code>(when(col("iris_class") == 'Iris-setosa', 0) .when(col("iris_class") == 'Iris-versicolor', 1) .otherwise(2)) </code></pre> which is equivalent to <pre class="prettyprint"><code>CASE WHEN (iris_class = 'Iris-setosa') THEN 0 WHEN (iris_class = 'Iris-versicolor') THEN 1 ELSE 2 END </code></pre> or: <pre class="prettyprint"><code>(when(col("iris_class") == 'Iris-setosa', 0) .otherwise(when(col("iris_class") == 'Iris-versicolor', 1) .otherwise(2))) </code></pre> which is equivalent to: <pre class="prettyprint"><code>CASE WHEN (iris_class = 'Iris-setosa') THEN 0 ELSE CASE WHEN (iris_class = 'Iris-versicolor') THEN 1 ELSE 2 END END </code></pre> with general syntax: <pre class="prettyprint"><code>when(condition, value).when(...) </code></pre> or <pre class="prettyprint"><code>when(condition, value).otherwise(...) </code></pre> You probably mixed up things with Hive <code>IF</code> conditional: <pre class="prettyprint"><code>IF(condition, if-true, if-false) </code></pre> which can be used only in raw SQL with Hive support.

Spark Equivalent of IF Then ELSE

from pyspark.sql import functions as F  iris_spark_df = iris_spark.withColumn(     "Class",     F.when(iris_spark.iris_class == 'Iris-setosa', 0, F.when(iris_spark.iris_class == 'Iris-versicolor',1)).otherwise(2))

Throws the following error.

--------------------------------------------------------------------------- TypeError                                 Traceback (most recent call last) <ipython-input-157-21818c7dc060> in <module>() ----> 1 iris_spark_df=iris_spark.withColumn("Class",F.when(iris_spark.iris_class=='Iris-setosa',0,F.when(iris_spark.iris_class=='Iris-versicolor',1)))  TypeError: when() takes exactly 2 arguments (3 given)   --------------------------------------------------------------------------- TypeError                                 Traceback (most recent call last) <ipython-input-157-21818c7dc060> in <module>() ----> 1 iris_spark_df=iris_spark.withColumn("Class",F.when(iris_spark.iris_class=='Iris-setosa',0,F.when(iris_spark.iris_class=='Iris-versicolor',1)))  TypeError: when() takes exactly 2 arguments (3 given)

Any idea why?

712

asked Aug 19 '16 21:08

Baktaawar

2 Answers

Correct structure is either:

(when(col("iris_class") == 'Iris-setosa', 0) .when(col("iris_class") == 'Iris-versicolor', 1) .otherwise(2))

which is equivalent to

CASE      WHEN (iris_class = 'Iris-setosa') THEN 0     WHEN (iris_class = 'Iris-versicolor') THEN 1      ELSE 2 END

or:

(when(col("iris_class") == 'Iris-setosa', 0)     .otherwise(when(col("iris_class") == 'Iris-versicolor', 1)         .otherwise(2)))

which is equivalent to:

CASE WHEN (iris_class = 'Iris-setosa') THEN 0       ELSE CASE WHEN (iris_class = 'Iris-versicolor') THEN 1                 ELSE 2            END  END

with general syntax:

when(condition, value).when(...)

when(condition, value).otherwise(...)

You probably mixed up things with Hive IF conditional:

IF(condition, if-true, if-false)

which can be used only in raw SQL with Hive support.

102

answered Oct 02 '22 14:10

zero323

Conditional statement In Spark

Using “when otherwise” on DataFrame
Using “case when” on DataFrame
Using && and || operator

import org.apache.spark.sql.functions.{when, _} import spark.sqlContext.implicits._  val spark: SparkSession = SparkSession.builder().master("local[1]").appName("SparkByExamples.com").getOrCreate()  val data = List(("James ","","Smith","36636","M",60000),         ("Michael ","Rose","","40288","M",70000),         ("Robert ","","Williams","42114","",400000),         ("Maria ","Anne","Jones","39192","F",500000),         ("Jen","Mary","Brown","","F",0))  val cols = Seq("first_name","middle_name","last_name","dob","gender","salary") val df = spark.createDataFrame(data).toDF(cols:_*)

1. Using “when otherwise” on DataFrame

Replace the value of gender with new value

val df1 = df.withColumn("new_gender", when(col("gender") === "M","Male")       .when(col("gender") === "F","Female")       .otherwise("Unknown"))  val df2 = df.select(col("*"), when(col("gender") === "M","Male")       .when(col("gender") === "F","Female")       .otherwise("Unknown").alias("new_gender"))

2. Using “case when” on DataFrame

val df3 = df.withColumn("new_gender",   expr("case when gender = 'M' then 'Male' " +                    "when gender = 'F' then 'Female' " +                    "else 'Unknown' end"))

Alternatively,

val df4 = df.select(col("*"),       expr("case when gender = 'M' then 'Male' " +                        "when gender = 'F' then 'Female' " +                        "else 'Unknown' end").alias("new_gender"))

3. Using && and || operator

val dataDF = Seq(       (66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4"       )).toDF("id", "code", "amt") dataDF.withColumn("new_column",        when(col("code") === "a" || col("code") === "d", "A")       .when(col("code") === "b" && col("amt") === "4", "B")       .otherwise("A1"))       .show()

Output:

+---+----+---+----------+ | id|code|amt|new_column| +---+----+---+----------+ | 66|   a|  4|         A| | 67|   a|  0|         A| | 70|   b|  4|         B| | 71|   d|  4|         A| +---+----+---+----------+

answered Oct 02 '22 14:10

vj sreenivasan

Related questions
                            
                                True dynamic and anonymous functions possible in Python?
                            
                                libpython2.7.so.1.0: cannot open shared object file: No such file or directory
                            
                                Upgraded to Ubuntu 16.04 now MySQL-python dependencies are broken
                            
                                Setting delete-orphan on SQLAlchemy relationship causes AssertionError: This AttributeImpl is not configured to track parents
                            
                                tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer
                            
                                Django: from django.urls import reverse; ImportError: No module named urls [duplicate]
                            
                                If all in list == something
                            
                                What does list[x::y] do? [duplicate]
                            
                                Single legend for multiple axes [duplicate]
                            
                                Scrapy: how to disable or change log?
                            
                                Difference between ManyToOneRel and ForeignKey?
                            
                                urllib2 file name
                            
                                For list unless empty in python
                            
                                Is there a Python equivalent to the 'which' command [duplicate]
                            
                                Writing to a file in a for loop
                            
                                I have need the N minimum (index) values in a numpy array
                            
                                Python - `break` out of all loops [duplicate]
                            
                                Virtualenv - Python 3 - Ubuntu 14.04 64 bit
                            
                                how can i use pip with pypy installed from launchpad?
                            
                                Where can I find a list of the Flask SQLAlchemy Column types and options?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Equivalent of IF Then ELSE

Tags:

python

apache-spark

apache-spark-sql

pyspark

Baktaawar

People also ask

2 Answers

zero323

vj sreenivasan

Recent Activity

Donate For Us