Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark filter dataframe if column does not contain string

I hope it wasn't asked before, at least I couldn't find. I'm trying to exclude rows where Key column does not contain 'sd' value. Below is the working example for when it contains.

values = [("sd123","2"),("kd123","1")] 
columns = ['Key', 'V1']
df2 = spark.createDataFrame(values, columns)

df2.where(F.col('Key').contains('sd')).show()

how to do the opposite?

like image 561
AlienDeg Avatar asked Dec 17 '20 08:12

AlienDeg


People also ask

How do you check if a column contains a string in PySpark?

In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.

Is not in condition in PySpark?

PySpark IS NOT IN condition is used to exclude the defined multiple values in a where() or filter() function condition. In other words, it is used to check/filter if the DataFrame values do not exist/contains in the list of values.

How do you use isNULL in PySpark?

In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class. The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame. All the above examples return the same output.


1 Answers

Use ~ as bitwise NOT:

df2.where(~F.col('Key').contains('sd')).show()
like image 127
mck Avatar answered Nov 15 '22 00:11

mck