I have a dataset
with missing values , I would like to get the number of missing values for each columns. Following is what I did , I got the number of non missing values. How can I use it to get the number of missing values?
df.describe().filter($"summary" === "count").show
+-------+---+---+---+
|summary| x| y| z|
+-------+---+---+---+
| count| 1| 2| 3|
+-------+---+---+---+
Any help please to get a dataframe
in which we'll find columns and number of missing values for each one.
Count of null values of dataframe in pyspark is obtained using null() Function. Count of Missing values of dataframe in pyspark is obtained using isnan() Function.
In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.
In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.
In PySpark, you can use distinct(). count() of DataFrame or countDistinct() SQL function to get the count distinct. distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame, count() returns the count of records on DataFrame.
You could count the missing values by summing the boolean output of the isNull()
method, after converting it to type integer:
In Scala
:
import org.apache.spark.sql.functions.{sum, col}
df.select(df.columns.map(c => sum(col(c).isNull.cast("int")).alias(c)): _*).show
In Python
:
from pyspark.sql.functions import col,sum
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show()
Alternatively, you could also use the output of df.describe().filter($"summary" === "count")
, and subtract the number in each cell by the number of rows in the data:
In Scala
:
import org.apache.spark.sql.functions.lit,
val rows = df.count()
val summary = df.describe().filter($"summary" === "count")
summary.select(df.columns.map(c =>(lit(rows) - col(c)).alias(c)): _*).show
In Python
:
from pyspark.sql.functions import lit
rows = df.count()
summary = df.describe().filter(col("summary") == "count")
summary.select(*((lit(rows)-col(c)).alias(c) for c in df.columns)).show()
from pyspark.sql.functions import isnull, when, count, col
nacounts = df.select([count(when(isnull(c), c)).alias(c) for c in df.columns]).toPandas()
nacounts
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With