Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting number of nulls in pyspark dataframe by row

So I want to count the number of nulls in a dataframe by row.

Please note, there are 50+ columns, I know I could do a case/when statement to do this, but I would prefer a neater solution.

For example, a subset:

columns = ['id', 'item1', 'item2', 'item3']
vals = [(1, 2, 0, None),(2, None, 1, None),(3,None,9, 1)]
df=spark.createDataFrame(vals,columns)
df.show()

+---+-----+-----+-----+
| id|item1|item2|item3|
+---+-----+-----+-----+
|  1|    2|  'A'| null|
|  2| null|    1| null|
|  3| null|    9|  'C'|
+---+-----+-----+-----+

After running the code, the desired output is:

+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
|  1|    2|  'A'| null|       1|
|  2| null|    1| null|       2|
|  3| null|    9|  'C'|       1|
+---+-----+-----+-----+--------+

EDIT: Not all non null values are ints.

like image 937
tormond Avatar asked Oct 17 '18 22:10

tormond


People also ask

How do you count nulls in PySpark DataFrame?

In PySpark DataFrame you can calculate the count of Null, None, NaN or Empty/Blank values in a column by using isNull() of Column class & SQL functions isnan() count() and when().

How do you count the number of NULL values in a column PySpark?

Does PySpark count include null? Count of null values of dataframe in pyspark is obtained using null() Function. Count of Missing values of dataframe in pyspark is obtained using isnan() Function.

How do I count PySpark DataFrame rows?

To get the number of rows from the PySpark DataFrame use the count() function. This function returns the total number of rows from the DataFrame.

IS NULL check in PySpark?

isNull() function is used to check if the current expression is NULL/None or column contains a NULL/None value, if it contains it returns a boolean value True. pyspark. sql. Column.


1 Answers

Convert null to 1 and others to 0 and then sum all the columns:

df.withColumn('numNulls', sum(df[col].isNull().cast('int') for col in df.columns)).show()
+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
|  1|    2|    0| null|       1|
|  2| null|    1| null|       2|
|  3| null|    9|    1|       1|
+---+-----+-----+-----+--------+
like image 174
Psidom Avatar answered Sep 29 '22 09:09

Psidom