import numpy as np data = [ (1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float("nan")), (1, 6, float("nan")), ] df = spark.createDataFrame(data, ("session", "timestamp1", "id2"))
Expected output
dataframe with count of nan/null for each column
Note: The previous questions I found in stack overflow only checks for null & not nan. That's why I have created a new question.
I know I can use isnull()
function in Spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe?
In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.
Filter Rows with NULL Values in DataFrame In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class. The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame.
Count of null values of dataframe in pyspark is obtained using null() Function. Count of Missing values of dataframe in pyspark is obtained using isnan() Function.
The count function counts the data and returns the data to the driver in PySpark, making the type action in PySpark. This count function in PySpark is used to count the number of rows that are present in the data frame post/pre-data analysis.
You can use method shown here and replace isNull
with isnan
:
from pyspark.sql.functions import isnan, when, count, col df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show() +-------+----------+---+ |session|timestamp1|id2| +-------+----------+---+ | 0| 0| 3| +-------+----------+---+
or
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show() +-------+----------+---+ |session|timestamp1|id2| +-------+----------+---+ | 0| 0| 5| +-------+----------+---+
To make sure it does not fail for string
, date
and timestamp
columns:
import pyspark.sql.functions as F def count_missings(spark_df,sort=True): """ Counts number of nulls and nans in each column """ df = spark_df.select([F.count(F.when(F.isnan(c) | F.isnull(c), c)).alias(c) for (c,c_type) in spark_df.dtypes if c_type not in ('timestamp', 'string', 'date')]).toPandas() if len(df) == 0: print("There are no any missing values!") return None if sort: return df.rename(index={0: 'count'}).T.sort_values("count",ascending=False) return df
If you want to see the columns sorted based on the number of nans and nulls in descending:
count_missings(spark_df) # | Col_A | 10 | # | Col_C | 2 | # | Col_B | 1 |
If you don't want ordering and see them as a single row:
count_missings(spark_df, False) # | Col_A | Col_B | Col_C | # | 10 | 1 | 2 |
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With