How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

Tags:

import numpy as np  data = [     (1, 1, None),      (1, 2, float(5)),      (1, 3, np.nan),      (1, 4, None),      (1, 5, float(10)),      (1, 6, float("nan")),      (1, 6, float("nan")), ] df = spark.createDataFrame(data, ("session", "timestamp1", "id2"))

Expected output

dataframe with count of nan/null for each column

Note: The previous questions I found in stack overflow only checks for null & not nan. That's why I have created a new question.

I know I can use isnull() function in Spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe?

856

asked Jun 19 '17 09:06

GeorgeOfTheRF

2 Answers

You can use method shown here and replace isNull with isnan:

from pyspark.sql.functions import isnan, when, count, col  df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show() +-------+----------+---+ |session|timestamp1|id2| +-------+----------+---+ |      0|         0|  3| +-------+----------+---+

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show() +-------+----------+---+ |session|timestamp1|id2| +-------+----------+---+ |      0|         0|  5| +-------+----------+---+

138

answered Sep 21 '22 09:09

user8183279

To make sure it does not fail for string, date and timestamp columns:

import pyspark.sql.functions as F def count_missings(spark_df,sort=True):     """     Counts number of nulls and nans in each column     """     df = spark_df.select([F.count(F.when(F.isnan(c) | F.isnull(c), c)).alias(c) for (c,c_type) in spark_df.dtypes if c_type not in ('timestamp', 'string', 'date')]).toPandas()      if len(df) == 0:         print("There are no any missing values!")         return None      if sort:         return df.rename(index={0: 'count'}).T.sort_values("count",ascending=False)      return df

If you want to see the columns sorted based on the number of nans and nulls in descending:

count_missings(spark_df)  # | Col_A | 10 | # | Col_C | 2  | # | Col_B | 1  |

If you don't want ordering and see them as a single row:

count_missings(spark_df, False) # | Col_A | Col_B | Col_C | # |  10   |   1   |   2   |

answered Sep 21 '22 09:09

gench

Related questions
                            
                                How do I convert an array (i.e. list) column to Vector
                            
                                How to join on multiple columns in Pyspark?
                            
                                How does createOrReplaceTempView work in Spark?
                            
                                Create Spark DataFrame. Can not infer schema for type: <type 'float'>
                            
                                What is the difference between spark checkpoint and persist to a disk
                            
                                How to use Column.isin with list?
                            
                                Querying Spark SQL DataFrame with complex types
                            
                                How to make good reproducible Apache Spark examples
                            
                                How to use JDBC source to write and read data in (Py)Spark?
                            
                                Cannot find col function in pyspark
                            
                                pyspark dataframe filter or include based on list
                            
                                how to filter out a null value from spark dataframe
                            
                                How to find median and quantiles using Spark
                            
                                Pyspark: Split multiple array columns into rows
                            
                                What is the relationship between workers, worker instances, and executors?
                            
                                Is it possible to get the current spark context settings in PySpark?
                            
                                How to pivot Spark DataFrame?
                            
                                how to make saveAsTextFile NOT split output into multiple file?
                            
                                How to prevent java.lang.OutOfMemoryError: PermGen space at Scala compilation?
                            
                                Pyspark: Exception: Java gateway process exited before sending the driver its port number

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

Tags:

apache-spark

apache-spark-sql

pyspark

GeorgeOfTheRF

People also ask

2 Answers

user8183279

gench

Recent Activity

Donate For Us