So I want to count the number of nulls in a dataframe by row.
Please note, there are 50+ columns, I know I could do a case/when statement to do this, but I would prefer a neater solution.
For example, a subset:
columns = ['id', 'item1', 'item2', 'item3']
vals = [(1, 2, 0, None),(2, None, 1, None),(3,None,9, 1)]
df=spark.createDataFrame(vals,columns)
df.show()
+---+-----+-----+-----+
| id|item1|item2|item3|
+---+-----+-----+-----+
| 1| 2| 'A'| null|
| 2| null| 1| null|
| 3| null| 9| 'C'|
+---+-----+-----+-----+
After running the code, the desired output is:
+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
| 1| 2| 'A'| null| 1|
| 2| null| 1| null| 2|
| 3| null| 9| 'C'| 1|
+---+-----+-----+-----+--------+
EDIT: Not all non null values are ints.
In PySpark DataFrame you can calculate the count of Null, None, NaN or Empty/Blank values in a column by using isNull() of Column class & SQL functions isnan() count() and when().
Does PySpark count include null? Count of null values of dataframe in pyspark is obtained using null() Function. Count of Missing values of dataframe in pyspark is obtained using isnan() Function.
To get the number of rows from the PySpark DataFrame use the count() function. This function returns the total number of rows from the DataFrame.
isNull() function is used to check if the current expression is NULL/None or column contains a NULL/None value, if it contains it returns a boolean value True. pyspark. sql. Column.
Convert null
to 1
and others to 0
and then sum
all the columns:
df.withColumn('numNulls', sum(df[col].isNull().cast('int') for col in df.columns)).show()
+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
| 1| 2| 0| null| 1|
| 2| null| 1| null| 2|
| 3| null| 9| 1| 1|
+---+-----+-----+-----+--------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With