Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark dataframe operator "IS NOT IN"

Tags:

pyspark

I would like to rewrite this from R to Pyspark, any nice looking suggestions?

array <- c(1,2,3) dataset <- filter(!(column %in% array)) 
like image 762
Babu Avatar asked Oct 27 '16 14:10

Babu


People also ask

Is not in function in PySpark?

In Spark isin() function is used to check if the DataFrame column value exists in a list/array of values. To use IS NOT IN, use the NOT operator to negate the result of the isin() function.

IS NOT NULL in PySpark?

Solution: In order to find non-null values of PySpark DataFrame columns, we need to use negate of isNotNull() function for example ~df. name. isNotNull() similarly for non-nan values ~isnan(df.name) .

How do you use isNULL in PySpark?

In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class. The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame.


2 Answers

In pyspark you can do it like this:

array = [1, 2, 3] dataframe.filter(dataframe.column.isin(array) == False) 

Or using the binary NOT operator:

dataframe.filter(~dataframe.column.isin(array)) 
like image 136
Ryan Widmaier Avatar answered Sep 28 '22 10:09

Ryan Widmaier


Take the operator ~ which means contrary :

df_filtered = df.filter(~df["column_name"].isin([1, 2, 3])) 
like image 34
LaSul Avatar answered Sep 28 '22 08:09

LaSul