Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multiple conditions for filter in spark data frames

I have a data frame with four fields. one of the field name is Status and i am trying to use a OR condition in .filter for a dataframe . I tried below queries but no luck.

df2 = df1.filter(("Status=2") || ("Status =3"))

df2 = df1.filter("Status=2" || "Status =3")

Has anyone used this before. I have seen a similar question on stack overflow here . They have used below code for using OR condition. But that code is for pyspark.

from pyspark.sql.functions import col 

numeric_filtered = df.where(
(col('LOW')    != 'null') | 
(col('NORMAL') != 'null') |
(col('HIGH')   != 'null'))
numeric_filtered.show()
like image 611
dheee Avatar asked Mar 09 '16 01:03

dheee


People also ask

How do I apply multiple filters in Spark DataFrame?

Filter with Multiple Conditions To filter() rows on Spark DataFrame based on multiple conditions using AND(&&), OR(||), and NOT(!), you case use either Column with a condition or SQL expression as explained above. Below is just a simple example, you can extend this with AND(&&), OR(||), and NOT(!)

How do you give multiple conditions in a filter in PySpark?

In PySpark, to filter() rows on DataFrame based on multiple conditions, you case use either Column with a condition or SQL expression. Below is just a simple example using AND (&) condition, you can extend this with OR(|), and NOT(!) conditional expressions as needed.

How do I filter rows in Spark DataFrame?

DataFrame where() with Column condition Use Column with the condition to filter the rows from DataFrame, using this you can express complex condition by referring column names using col(name) , $"colname" dfObject("colname") , this approach is mostly used while working with DataFrames. Use “===” for comparison.

What is the function of filter () in Spark?

In Spark, the Filter function returns a new dataset formed by selecting those elements of the source on which the function returns true. So, it retrieves only the elements that satisfy the given condition.


3 Answers

Instead of:

df2 = df1.filter("Status=2" || "Status =3")

Try:

df2 = df1.filter($"Status" === 2 || $"Status" === 3)
like image 103
David Griffin Avatar answered Oct 01 '22 07:10

David Griffin


This question has been answered but for future reference, I would like to mention that, in the context of this question, the where and filter methods in Dataset/Dataframe supports two syntaxes: The SQL string parameters:

df2 = df1.filter(("Status = 2 or Status = 3"))

and Col based parameters (mentioned by @David ):

df2 = df1.filter($"Status" === 2 || $"Status" === 3)

It seems the OP'd combined these two syntaxes. Personally, I prefer the first syntax because it's cleaner and more generic.

like image 30
Amin Avatar answered Oct 01 '22 07:10

Amin


In spark/scala, it's pretty easy to filter with varargs.

val d = spark.read...//data contains column named matid
val ids = Seq("BNBEL0608AH", "BNBEL00608H")
val filtered = d.filter($"matid".isin(ids:_*))
like image 4
Tony Fraser Avatar answered Oct 01 '22 07:10

Tony Fraser