Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

import numpy as np  data = [     (1, 1, None),      (1, 2, float(5)),      (1, 3, np.nan),      (1, 4, None),      (1, 5, float(10)),      (1, 6, float("nan")),      (1, 6, float("nan")), ] df = spark.createDataFrame(data, ("session", "timestamp1", "id2")) 

Expected output

dataframe with count of nan/null for each column

Note: The previous questions I found in stack overflow only checks for null & not nan. That's why I have created a new question.

I know I can use isnull() function in Spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe?

like image 856
GeorgeOfTheRF Avatar asked Jun 19 '17 09:06

GeorgeOfTheRF


People also ask

How do you count values in a column in PySpark DataFrame?

In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.

How do you display NULL values in PySpark DataFrame?

Filter Rows with NULL Values in DataFrame In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class. The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame.

Does PySpark count include NULL?

Count of null values of dataframe in pyspark is obtained using null() Function. Count of Missing values of dataframe in pyspark is obtained using isnan() Function.

How does count work in PySpark?

The count function counts the data and returns the data to the driver in PySpark, making the type action in PySpark. This count function in PySpark is used to count the number of rows that are present in the data frame post/pre-data analysis.


2 Answers

You can use method shown here and replace isNull with isnan:

from pyspark.sql.functions import isnan, when, count, col  df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show() +-------+----------+---+ |session|timestamp1|id2| +-------+----------+---+ |      0|         0|  3| +-------+----------+---+ 

or

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show() +-------+----------+---+ |session|timestamp1|id2| +-------+----------+---+ |      0|         0|  5| +-------+----------+---+ 
like image 138
user8183279 Avatar answered Sep 21 '22 09:09

user8183279


To make sure it does not fail for string, date and timestamp columns:

import pyspark.sql.functions as F def count_missings(spark_df,sort=True):     """     Counts number of nulls and nans in each column     """     df = spark_df.select([F.count(F.when(F.isnan(c) | F.isnull(c), c)).alias(c) for (c,c_type) in spark_df.dtypes if c_type not in ('timestamp', 'string', 'date')]).toPandas()      if len(df) == 0:         print("There are no any missing values!")         return None      if sort:         return df.rename(index={0: 'count'}).T.sort_values("count",ascending=False)      return df 

If you want to see the columns sorted based on the number of nans and nulls in descending:

count_missings(spark_df)  # | Col_A | 10 | # | Col_C | 2  | # | Col_B | 1  |  

If you don't want ordering and see them as a single row:

count_missings(spark_df, False) # | Col_A | Col_B | Col_C | # |  10   |   1   |   2   | 
like image 45
gench Avatar answered Sep 21 '22 09:09

gench