I would like to modify the cell values of a dataframe column (Age) where currently it is blank and I would only do it if another column (Survived) has the value 0 for the corresponding row where it is blank for Age. If it is 1 in the Survived column but blank in Age column then I will keep it as null. I tried to use <code>&&</code> operator but it didn't work. Here is my code: <pre class="prettyprint lang-none prettyprint-override"><code>tdata.withColumn("Age", when((tdata.Age == "" && tdata.Survived == "0"), mean_age_0).otherwise(tdata.Age)).show() </code></pre> Any suggestions how to handle that? Thanks. Error Message: <pre class="prettyprint"><code>SyntaxError: invalid syntax File "<ipython-input-33-3e691784411c>", line 1 tdata.withColumn("Age", when((tdata.Age == "" && tdata.Survived == "0"), mean_age_0).otherwise(tdata.Age)).show() ^ </code></pre>

You get <code>SyntaxError</code> error exception because Python has no <code>&&</code> operator. It has <code>and</code> and <code>&</code> where the latter one is the correct choice to create boolean expressions on <code>Column</code> (<code>|</code> for a logical disjunction and <code>~</code> for logical negation). Condition you created is also invalid because it doesn't consider operator precedence. <code>&</code> in Python has a higher precedence than <code>==</code> so expression has to be parenthesized. <pre class="prettyprint"><code>(col("Age") == "") & (col("Survived") == "0") ## Column<b'((Age = ) AND (Survived = 0))'> </code></pre> On a side note <code>when</code> function is equivalent to <code>case</code> expression not <code>WHEN</code> clause. Still the same rules apply. Conjunction: <pre class="prettyprint"><code>df.where((col("foo") > 0) & (col("bar") < 0)) </code></pre> Disjunction: <pre class="prettyprint"><code>df.where((col("foo") > 0) | (col("bar") < 0)) </code></pre> You can of course define conditions separately to avoid brackets: <pre class="prettyprint"><code>cond1 = col("Age") == "" cond2 = col("Survived") == "0" cond1 & cond2 </code></pre>

when in pyspark multiple conditions can be built using &(for and) and | (for or). Note:In pyspark t is important to enclose every expressions within parenthesis () that combine to form the condition <pre class="prettyprint"><code>%pyspark dataDF = spark.createDataFrame([(66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4")], ("id", "code", "amt")) dataDF.withColumn("new_column", when((col("code") == "a") | (col("code") == "d"), "A") .when((col("code") == "b") & (col("amt") == "4"), "B") .otherwise("A1")).show() </code></pre> In Spark Scala code (&&) or (||) conditions can be used within when function <pre class="prettyprint"><code>//scala val dataDF = Seq( (66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4" )).toDF("id", "code", "amt") dataDF.withColumn("new_column", when(col("code") === "a" || col("code") === "d", "A") .when(col("code") === "b" && col("amt") === "4", "B") .otherwise("A1")).show() </code></pre> ======================= <pre class="prettyprint"><code>Output: +---+----+---+----------+ | id|code|amt|new_column| +---+----+---+----------+ | 66| a| 4| A| | 67| a| 0| A| | 70| b| 4| B| | 71| d| 4| A| +---+----+---+----------+ </code></pre> This code snippet is copied from sparkbyexamples.com

PySpark: multiple conditions in when clause

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

I would like to modify the cell values of a dataframe column (Age) where currently it is blank and I would only do it if another column (Survived) has the value 0 for the corresponding row where it is blank for Age. If it is 1 in the Survived column but blank in Age column then I will keep it as null.

I tried to use && operator but it didn't work. Here is my code:

tdata.withColumn("Age",  when((tdata.Age == "" && tdata.Survived == "0"), mean_age_0).otherwise(tdata.Age)).show()

Any suggestions how to handle that? Thanks.

Error Message:

SyntaxError: invalid syntax
  File "<ipython-input-33-3e691784411c>", line 1
    tdata.withColumn("Age",  when((tdata.Age == "" && tdata.Survived == "0"), mean_age_0).otherwise(tdata.Age)).show()
                                                    ^

206

asked Jun 08 '16 15:06

sjishan

Video Answer

2 Answers

You get SyntaxError error exception because Python has no && operator. It has and and & where the latter one is the correct choice to create boolean expressions on Column (| for a logical disjunction and ~ for logical negation).

Condition you created is also invalid because it doesn't consider operator precedence. & in Python has a higher precedence than == so expression has to be parenthesized.

(col("Age") == "") & (col("Survived") == "0")
## Column<b'((Age = ) AND (Survived = 0))'>

On a side note when function is equivalent to case expression not WHEN clause. Still the same rules apply. Conjunction:

df.where((col("foo") > 0) & (col("bar") < 0))

Disjunction:

df.where((col("foo") > 0) | (col("bar") < 0))

You can of course define conditions separately to avoid brackets:

cond1 = col("Age") == "" 
cond2 = col("Survived") == "0"

cond1 & cond2

179

answered Oct 19 '22 10:10

zero323

when in pyspark multiple conditions can be built using &(for and) and | (for or).

Note:In pyspark t is important to enclose every expressions within parenthesis () that combine to form the condition

%pyspark
dataDF = spark.createDataFrame([(66, "a", "4"), 
                                (67, "a", "0"), 
                                (70, "b", "4"), 
                                (71, "d", "4")],
                                ("id", "code", "amt"))
dataDF.withColumn("new_column",
       when((col("code") == "a") | (col("code") == "d"), "A")
      .when((col("code") == "b") & (col("amt") == "4"), "B")
      .otherwise("A1")).show()

In Spark Scala code (&&) or (||) conditions can be used within when function

//scala
val dataDF = Seq(
      (66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4"
      )).toDF("id", "code", "amt")
dataDF.withColumn("new_column",
       when(col("code") === "a" || col("code") === "d", "A")
      .when(col("code") === "b" && col("amt") === "4", "B")
      .otherwise("A1")).show()

=======================

Output:
+---+----+---+----------+
| id|code|amt|new_column|
+---+----+---+----------+
| 66|   a|  4|         A|
| 67|   a|  0|         A|
| 70|   b|  4|         B|
| 71|   d|  4|         A|
+---+----+---+----------+

This code snippet is copied from sparkbyexamples.com