Suppose we have a simple dataframe:
from pyspark.sql.types import *
schema = StructType([
StructField('id', LongType(), False),
StructField('name', StringType(), False),
StructField('count', LongType(), True),
])
df = spark.createDataFrame([(1,'Alice',None), (2,'Bob',1)], schema)
The question is how to detect null values? I tried the following:
df.where(df.count == None).show()
df.where(df.count is 'null').show()
df.where(df.count == 'null').show()
It results in error:
condition should be string or Column
I know the following works:
df.where("count is null").show()
But is there a way to achieve with without the full string? I.e. df.count
...?
In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class. The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame. All the above examples return the same output.
You can filter out rows with NAN value from pandas DataFrame column string, float, datetime e.t.c by using DataFrame. dropna() and DataFrame. notnull() methods.
isNull() function is used to check if the current expression is NULL/None or column contains a NULL/None value, if it contains it returns a boolean value True. pyspark.
Solution: In order to find non-null values of PySpark DataFrame columns, we need to use negate of isNotNull() function for example ~df. name. isNotNull() similarly for non-nan values ~isnan(df.name) .
You can use Spark Function isnull
from pyspark.sql import functions as F
df.where(F.isnull(F.col("count"))).show()
or directly with the method isNull
df.where(F.col("count").isNull()).show()
Another way of doing the same is by using filter
api
from pyspark.sql import functions as F
df.filter(F.isnull("count")).show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With