I am trying to get the rows with null values from a pyspark dataframe. In pandas, I can achieve this using isnull()
on the dataframe:
df = df[df.isnull().any(axis=1)]
But in case of PySpark, when I am running below command it shows Attributeerror:
df.filter(df.isNull())
AttributeError: 'DataFrame' object has no attribute 'isNull'.
How can get the rows with null values without checking it for each column?
In order to remove Rows with NULL values on selected columns of Spark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.
In order to replace empty value with None/null on single DataFrame column, you can use withColumn() and when(). otherwise() function.
In PySpark DataFrame you can calculate the count of Null, None, NaN or Empty/Blank values in a column by using isNull() of Column class & SQL functions isnan() count() and when().
isNull() function is used to check if the current expression is NULL/None or column contains a NULL/None value, if it contains it returns a boolean value True. pyspark.
You can filter the rows with where
, reduce
and a list comprehension. For example, given the following dataframe:
df = sc.parallelize([
(0.4, 0.3),
(None, 0.11),
(9.7, None),
(None, None)
]).toDF(["A", "B"])
df.show()
+----+----+
| A| B|
+----+----+
| 0.4| 0.3|
|null|0.11|
| 9.7|null|
|null|null|
+----+----+
Filtering the rows with some null
value could be achieved with:
import pyspark.sql.functions as f
from functools import reduce
df.where(reduce(lambda x, y: x | y, (f.col(x).isNull() for x in df.columns))).show()
Which gives:
+----+----+
| A| B|
+----+----+
|null|0.11|
| 9.7|null|
|null|null|
+----+----+
In the condition statement you have to specify if any (or, |), all (and, &), etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With