I would like to rewrite this from R to Pyspark, any nice looking suggestions?
array <- c(1,2,3) dataset <- filter(!(column %in% array))
In Spark isin() function is used to check if the DataFrame column value exists in a list/array of values. To use IS NOT IN, use the NOT operator to negate the result of the isin() function.
Solution: In order to find non-null values of PySpark DataFrame columns, we need to use negate of isNotNull() function for example ~df. name. isNotNull() similarly for non-nan values ~isnan(df.name) .
In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class. The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame.
In pyspark you can do it like this:
array = [1, 2, 3] dataframe.filter(dataframe.column.isin(array) == False)
Or using the binary NOT operator:
dataframe.filter(~dataframe.column.isin(array))
Take the operator ~ which means contrary :
df_filtered = df.filter(~df["column_name"].isin([1, 2, 3]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With