Can anyone explain to me why I am getting different results for these 2 expressions ? I am trying to filter between 2 dates:
df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\ .select("col1","col2").distinct().count()
Result : 37M
vs
df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\ .select("col1","col2").distinct().count()
Result: 25M
How are they different ? It seems to me like they should produce the same result
Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.
We will select multiple rows in pandas using multiple conditions, logical operators and using loc() function. Selecting rows with logical operators i.e. AND and OR can be achieved easily with a combination of >, <, <=, >= and == to extract rows with multiple filters.
In this, first, pass your dataframe object to the filter function, then in the condition parameter write the column name in which you want to filter multiple values then put the %in% operator, and then pass a vector containing all the string values which you want in the result.
TL;DR To pass multiple conditions to filter
or where
use Column
objects and logical operators (&
, |
, ~
). See Pyspark: multiple conditions in when clause.
df.filter((col("act_date") >= "2016-10-01") & (col("act_date") <= "2017-04-01"))
You can also use a single SQL string:
df.filter("act_date >='2016-10-01' AND act_date <='2017-04-01'")
In practice it makes more sense to use between:
df.filter(col("act_date").between("2016-10-01", "2017-04-01")) df.filter("act_date BETWEEN '2016-10-01' AND '2017-04-01'")
The first approach is not even remote valid. In Python, and
returns:
As a result
"act_date <='2017-04-01'" and "act_date >='2016-10-01'"
is evaluated to (any non-empty string is truthy):
"act_date >='2016-10-01'"
In first case
df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\ .select("col1","col2").distinct().count()
the result is values more than 2016-10-01 that means all the values above 2017-04-01 also.
Whereas in second case
df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\ .select("col1","col2").distinct().count()
the result is the values between 2016-10-01 to 2017-04-01.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With