Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to return rows with Null values in pyspark dataframe?

I am trying to get the rows with null values from a pyspark dataframe. In pandas, I can achieve this using isnull() on the dataframe:

df = df[df.isnull().any(axis=1)]

But in case of PySpark, when I am running below command it shows Attributeerror:

df.filter(df.isNull())

AttributeError: 'DataFrame' object has no attribute 'isNull'.

How can get the rows with null values without checking it for each column?

like image 922
dg S Avatar asked Nov 26 '18 18:11

dg S


People also ask

How do you drop rows with null values in Spark DataFrame?

In order to remove Rows with NULL values on selected columns of Spark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.

How do I assign a null value in PySpark?

In order to replace empty value with None/null on single DataFrame column, you can use withColumn() and when(). otherwise() function.

How do I find null values in a column in PySpark?

In PySpark DataFrame you can calculate the count of Null, None, NaN or Empty/Blank values in a column by using isNull() of Column class & SQL functions isnan() count() and when().

IS NULL function in PySpark?

isNull() function is used to check if the current expression is NULL/None or column contains a NULL/None value, if it contains it returns a boolean value True. pyspark.


1 Answers

You can filter the rows with where, reduce and a list comprehension. For example, given the following dataframe:

df = sc.parallelize([
    (0.4, 0.3),
    (None, 0.11),
    (9.7, None), 
    (None, None)
]).toDF(["A", "B"])

df.show()
+----+----+
|   A|   B|
+----+----+
| 0.4| 0.3|
|null|0.11|
| 9.7|null|
|null|null|
+----+----+

Filtering the rows with some null value could be achieved with:

import pyspark.sql.functions as f
from functools import reduce

df.where(reduce(lambda x, y: x | y, (f.col(x).isNull() for x in df.columns))).show()

Which gives:

+----+----+
|   A|   B|
+----+----+
|null|0.11|
| 9.7|null|
|null|null|
+----+----+

In the condition statement you have to specify if any (or, |), all (and, &), etc.

like image 157
Amanda Avatar answered Oct 21 '22 11:10

Amanda