Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple condition filter on dataframe

Can anyone explain to me why I am getting different results for these 2 expressions ? I am trying to filter between 2 dates:

df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\   .select("col1","col2").distinct().count() 

Result : 37M

vs

df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\   .select("col1","col2").distinct().count() 

Result: 25M

How are they different ? It seems to me like they should produce the same result

like image 499
femibyte Avatar asked Aug 31 '17 09:08

femibyte


People also ask

How do you filter a DataFrame with a condition?

Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.

How do you use multiple filters on pandas?

We will select multiple rows in pandas using multiple conditions, logical operators and using loc() function. Selecting rows with logical operators i.e. AND and OR can be achieved easily with a combination of >, <, <=, >= and == to extract rows with multiple filters.

How do I filter multiple values in a column in R?

In this, first, pass your dataframe object to the filter function, then in the condition parameter write the column name in which you want to filter multiple values then put the %in% operator, and then pass a vector containing all the string values which you want in the result.


2 Answers

TL;DR To pass multiple conditions to filter or where use Column objects and logical operators (&, |, ~). See Pyspark: multiple conditions in when clause.

df.filter((col("act_date") >= "2016-10-01") & (col("act_date") <= "2017-04-01")) 

You can also use a single SQL string:

df.filter("act_date >='2016-10-01' AND act_date <='2017-04-01'") 

In practice it makes more sense to use between:

df.filter(col("act_date").between("2016-10-01", "2017-04-01")) df.filter("act_date BETWEEN '2016-10-01' AND '2017-04-01'") 

The first approach is not even remote valid. In Python, and returns:

  • The last element if all expressions are "truthy".
  • The first "falsey" element otherwise.

As a result

"act_date <='2017-04-01'" and "act_date >='2016-10-01'" 

is evaluated to (any non-empty string is truthy):

"act_date >='2016-10-01'" 
like image 178
zero323 Avatar answered Sep 18 '22 13:09

zero323


In first case

df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\   .select("col1","col2").distinct().count() 

the result is values more than 2016-10-01 that means all the values above 2017-04-01 also.

Whereas in second case

df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\   .select("col1","col2").distinct().count() 

the result is the values between 2016-10-01 to 2017-04-01.

like image 21
Ash Man Avatar answered Sep 20 '22 13:09

Ash Man