Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use AND or OR condition in when in Spark

I wanted to evaluate two conditions in when like this :-

import pyspark.sql.functions as F  df = df.withColumn(     'trueVal', F.when(df.value < 1 OR df.value2  == 'false' , 0 ).otherwise(df.value))  

For this I get 'invalid syntax' for using 'OR'

Even I tried using nested when statements :-

df = df.withColumn(     'v',      F.when(df.value < 1,(F.when( df.value =1,0).otherwise(df.value))).otherwise(df.value) )  

For this i get 'keyword can't be an expression' for nested when statements.

How could I use multiple conditions in when any work around ?

like image 232
Kiran Bhagwat Avatar asked Nov 18 '16 22:11

Kiran Bhagwat


People also ask

How do you write if else condition in Pyspark?

PySpark When Otherwise – when() is a SQL function that returns a Column type and otherwise() is a function of Column, if otherwise() is not used, it returns a None/NULL value. PySpark SQL Case When – This is similar to SQL expression, Usage: CASE WHEN cond1 THEN result WHEN cond2 THEN result... ELSE result END .

Where vs filter Pyspark?

Both 'filter' and 'where' in Spark SQL gives same result. There is no difference between the two. It's just filter is simply the standard Scala name for such a function, and where is for people who prefer SQL.

How do I filter rows in Spark Dataframe?

Spark filter() or where() function is used to filter the rows from DataFrame or Dataset based on the given one or multiple conditions or SQL expression. You can use where() operator instead of the filter if you are coming from SQL background. Both these functions operate exactly the same.

Does Spark support if statement?

Like SQL "case when" statement and “ Swith" , "if then else" statement from popular programming languages, Spark SQL Dataframe also supports similar syntax using “ when otherwise ” or we can also use “ case when ” statement.


1 Answers

pyspark.sql.DataFrame.where takes a Boolean Column as its condition. When using PySpark, it's often useful to think "Column Expression" when you read "Column".

Logical operations on PySpark columns use the bitwise operators:

  • & for and
  • | for or
  • ~ for not

When combining these with comparison operators such as <, parenthesis are often needed.

In your case, the correct statement is:

import pyspark.sql.functions as F df = df.withColumn('trueVal',     F.when((df.value < 1) | (df.value2 == 'false'), 0).otherwise(df.value)) 

See also: SPARK-8568

like image 55
Daniel Shields Avatar answered Sep 22 '22 10:09

Daniel Shields