I have a data frame with four fields. one of the field name is Status and i am trying to use a OR condition in .filter for a dataframe . I tried below queries but no luck.
df2 = df1.filter(("Status=2") || ("Status =3"))
df2 = df1.filter("Status=2" || "Status =3")
Has anyone used this before. I have seen a similar question on stack overflow here . They have used below code for using OR condition. But that code is for pyspark.
from pyspark.sql.functions import col
numeric_filtered = df.where(
(col('LOW') != 'null') |
(col('NORMAL') != 'null') |
(col('HIGH') != 'null'))
numeric_filtered.show()
Filter with Multiple Conditions To filter() rows on Spark DataFrame based on multiple conditions using AND(&&), OR(||), and NOT(!), you case use either Column with a condition or SQL expression as explained above. Below is just a simple example, you can extend this with AND(&&), OR(||), and NOT(!)
In PySpark, to filter() rows on DataFrame based on multiple conditions, you case use either Column with a condition or SQL expression. Below is just a simple example using AND (&) condition, you can extend this with OR(|), and NOT(!) conditional expressions as needed.
DataFrame where() with Column condition Use Column with the condition to filter the rows from DataFrame, using this you can express complex condition by referring column names using col(name) , $"colname" dfObject("colname") , this approach is mostly used while working with DataFrames. Use “===” for comparison.
In Spark, the Filter function returns a new dataset formed by selecting those elements of the source on which the function returns true. So, it retrieves only the elements that satisfy the given condition.
Instead of:
df2 = df1.filter("Status=2" || "Status =3")
Try:
df2 = df1.filter($"Status" === 2 || $"Status" === 3)
This question has been answered but for future reference, I would like to mention that, in the context of this question, the where
and filter
methods in Dataset/Dataframe supports two syntaxes:
The SQL string parameters:
df2 = df1.filter(("Status = 2 or Status = 3"))
and Col based parameters (mentioned by @David ):
df2 = df1.filter($"Status" === 2 || $"Status" === 3)
It seems the OP'd combined these two syntaxes. Personally, I prefer the first syntax because it's cleaner and more generic.
In spark/scala, it's pretty easy to filter with varargs.
val d = spark.read...//data contains column named matid
val ids = Seq("BNBEL0608AH", "BNBEL00608H")
val filtered = d.filter($"matid".isin(ids:_*))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With