I hope it wasn't asked before, at least I couldn't find. I'm trying to exclude rows where Key column does not contain 'sd' value. Below is the working example for when it contains.
values = [("sd123","2"),("kd123","1")]
columns = ['Key', 'V1']
df2 = spark.createDataFrame(values, columns)
df2.where(F.col('Key').contains('sd')).show()
how to do the opposite?
In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.
PySpark IS NOT IN condition is used to exclude the defined multiple values in a where() or filter() function condition. In other words, it is used to check/filter if the DataFrame values do not exist/contains in the list of values.
In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class. The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame. All the above examples return the same output.
Use ~
as bitwise NOT:
df2.where(~F.col('Key').contains('sd')).show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With