Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to filter null values in pyspark dataframe?

Suppose we have a simple dataframe:

from pyspark.sql.types import *

schema = StructType([
StructField('id', LongType(), False),
StructField('name', StringType(), False),
StructField('count', LongType(), True),
])
df = spark.createDataFrame([(1,'Alice',None), (2,'Bob',1)], schema)

The question is how to detect null values? I tried the following:

df.where(df.count == None).show()
df.where(df.count is 'null').show()
df.where(df.count == 'null').show()

It results in error:

condition should be string or Column

I know the following works:

df.where("count is null").show()

But is there a way to achieve with without the full string? I.e. df.count...?

like image 872
Miroslav Stola Avatar asked Dec 28 '17 13:12

Miroslav Stola


People also ask

How do I filter NULL values in PySpark?

In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class. The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame. All the above examples return the same output.

How do you filter NULL values in Python?

You can filter out rows with NAN value from pandas DataFrame column string, float, datetime e.t.c by using DataFrame. dropna() and DataFrame. notnull() methods.

IS NULL condition in PySpark?

isNull() function is used to check if the current expression is NULL/None or column contains a NULL/None value, if it contains it returns a boolean value True. pyspark.

How do you check NOT NULL values in PySpark?

Solution: In order to find non-null values of PySpark DataFrame columns, we need to use negate of isNotNull() function for example ~df. name. isNotNull() similarly for non-nan values ~isnan(df.name) .


2 Answers

You can use Spark Function isnull

from pyspark.sql import functions as F
df.where(F.isnull(F.col("count"))).show()

or directly with the method isNull

df.where(F.col("count").isNull()).show()
like image 173
Steven Avatar answered Oct 24 '22 15:10

Steven


Another way of doing the same is by using filter api

from pyspark.sql import functions as F
df.filter(F.isnull("count")).show()
like image 26
Ramesh Maharjan Avatar answered Oct 24 '22 15:10

Ramesh Maharjan