Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering a pyspark dataframe using isin by exclusion [duplicate]

I am trying to get all rows within a dataframe where a columns value is not within a list (so filtering by exclusion).

As an example:

df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')] ,schema=('id','bar')) 

I get the data frame:

+---+---+ | id|bar| +---+---+ |  1|  a| |  2|  b| |  3|  b| |  4|  c| |  5|  d| +---+---+ 

I only want to exclude rows where bar is ('a' or 'b').

Using an SQL expression string it would be:

df.filter('bar not in ("a","b")').show() 

Is there a way of doing it without using the string for the SQL expression, or excluding one item at a time?

Edit:

I am likely to have a list, ['a','b'], of the excluded values that I would like to use.

like image 651
gabrown86 Avatar asked Jan 21 '17 02:01

gabrown86


People also ask

How do you filter data from a DataFrame PySpark?

PySpark filter() function is used to filter the rows from RDD/DataFrame based on the given condition or SQL expression, you can also use where() clause instead of the filter() if you are coming from an SQL background, both these functions operate exactly the same.

How do you select distinct in PySpark?

We will use the select() method to get the distinct rows from the selected columns, the select() method is used to select columns, and after that, we have to use the distinct() function to return unique values from the selected column, and Finally, we have to use collect() method to get the rows returned by the ...


1 Answers

It looks like the ~ gives the functionality that I need, but I am yet to find any appropriate documentation on it.

df.filter(~col('bar').isin(['a','b'])).show()    +---+---+ | id|bar| +---+---+ |  4|  c| |  5|  d| +---+---+ 
like image 150
gabrown86 Avatar answered Oct 05 '22 00:10

gabrown86