Logo Questions Linux Laravel Mysql Ubuntu Git Menu

PySpark: multiple conditions in when clause

I would like to modify the cell values of a dataframe column (Age) where currently it is blank and I would only do it if another column (Survived) has the value 0 for the corresponding row where it is blank for Age. If it is 1 in the Survived column but blank in Age column then I will keep it as null.

I tried to use && operator but it didn't work. Here is my code:

tdata.withColumn("Age",  when((tdata.Age == "" && tdata.Survived == "0"), mean_age_0).otherwise(tdata.Age)).show()

Any suggestions how to handle that? Thanks.

Error Message:

SyntaxError: invalid syntax
  File "<ipython-input-33-3e691784411c>", line 1
    tdata.withColumn("Age",  when((tdata.Age == "" && tdata.Survived == "0"), mean_age_0).otherwise(tdata.Age)).show()
like image 206
sjishan Avatar asked Jun 08 '16 15:06


People also ask

How do you write multiple conditions in PySpark?

when in pyspark multiple conditions can be built using &(for and) and | (for or).

How do you use conditions in PySpark?

PySpark when() is SQL function, in order to use this first you should import and this returns a Column type, otherwise() is a function of Column , when otherwise() not used and none of the conditions met it assigns None (Null) value. Usage would be like when(condition).

How do you subset in PySpark?

To subset or filter the data from the dataframe we are using the filter() function. The filter function is used to filter the data from the dataframe on the basis of the given condition it should be single or multiple. where df is the dataframe from which the data is subset or filtered.

How do you subset rows in PySpark?

If you want to select a subset of rows, one method is to create an index column using monotonically_increasing_id() . From the docs: The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

Video Answer

2 Answers

You get SyntaxError error exception because Python has no && operator. It has and and & where the latter one is the correct choice to create boolean expressions on Column (| for a logical disjunction and ~ for logical negation).

Condition you created is also invalid because it doesn't consider operator precedence. & in Python has a higher precedence than == so expression has to be parenthesized.

(col("Age") == "") & (col("Survived") == "0")
## Column<b'((Age = ) AND (Survived = 0))'>

On a side note when function is equivalent to case expression not WHEN clause. Still the same rules apply. Conjunction:

df.where((col("foo") > 0) & (col("bar") < 0))


df.where((col("foo") > 0) | (col("bar") < 0))

You can of course define conditions separately to avoid brackets:

cond1 = col("Age") == "" 
cond2 = col("Survived") == "0"

cond1 & cond2
like image 179
zero323 Avatar answered Oct 19 '22 10:10


when in pyspark multiple conditions can be built using &(for and) and | (for or).

Note:In pyspark t is important to enclose every expressions within parenthesis () that combine to form the condition

dataDF = spark.createDataFrame([(66, "a", "4"), 
                                (67, "a", "0"), 
                                (70, "b", "4"), 
                                (71, "d", "4")],
                                ("id", "code", "amt"))
       when((col("code") == "a") | (col("code") == "d"), "A")
      .when((col("code") == "b") & (col("amt") == "4"), "B")

In Spark Scala code (&&) or (||) conditions can be used within when function

val dataDF = Seq(
      (66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4"
      )).toDF("id", "code", "amt")
       when(col("code") === "a" || col("code") === "d", "A")
      .when(col("code") === "b" && col("amt") === "4", "B")


| id|code|amt|new_column|
| 66|   a|  4|         A|
| 67|   a|  0|         A|
| 70|   b|  4|         B|
| 71|   d|  4|         A|

This code snippet is copied from sparkbyexamples.com

like image 28
vj sreenivasan Avatar answered Oct 19 '22 10:10

vj sreenivasan