I am trying to get all rows within a dataframe where a columns value is not within a list (so filtering by exclusion).
As an example:
df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')] ,schema=('id','bar'))
I get the data frame:
+---+---+ | id|bar| +---+---+ | 1| a| | 2| b| | 3| b| | 4| c| | 5| d| +---+---+
I only want to exclude rows where bar is ('a' or 'b').
Using an SQL expression string it would be:
df.filter('bar not in ("a","b")').show()
Is there a way of doing it without using the string for the SQL expression, or excluding one item at a time?
Edit:
I am likely to have a list, ['a','b'], of the excluded values that I would like to use.
PySpark filter() function is used to filter the rows from RDD/DataFrame based on the given condition or SQL expression, you can also use where() clause instead of the filter() if you are coming from an SQL background, both these functions operate exactly the same.
We will use the select() method to get the distinct rows from the selected columns, the select() method is used to select columns, and after that, we have to use the distinct() function to return unique values from the selected column, and Finally, we have to use collect() method to get the rows returned by the ...
It looks like the ~ gives the functionality that I need, but I am yet to find any appropriate documentation on it.
df.filter(~col('bar').isin(['a','b'])).show() +---+---+ | id|bar| +---+---+ | 4| c| | 5| d| +---+---+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With