I have a <code>dataset</code> with missing values , I would like to get the number of missing values for each columns. Following is what I did , I got the number of non missing values. How can I use it to get the number of missing values? <pre class="prettyprint lang-scala prettyprint-override"><code>df.describe().filter($"summary" === "count").show </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>+-------+---+---+---+ |summary| x| y| z| +-------+---+---+---+ | count| 1| 2| 3| +-------+---+---+---+ </code></pre> Any help please to get a <code>dataframe</code> in which we'll find columns and number of missing values for each one.

You could count the missing values by summing the boolean output of the <code>isNull()</code> method, after converting it to type integer: In <code>Scala</code>: <pre class="prettyprint lang-scala prettyprint-override"><code>import org.apache.spark.sql.functions.{sum, col} df.select(df.columns.map(c => sum(col(c).isNull.cast("int")).alias(c)): _*).show </code></pre> In <code>Python</code>: <pre class="prettyprint"><code>from pyspark.sql.functions import col,sum df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show() </code></pre> Alternatively, you could also use the output of <code>df.describe().filter($"summary" === "count")</code>, and subtract the number in each cell by the number of rows in the data: In <code>Scala</code>: <pre class="prettyprint lang-scala prettyprint-override"><code>import org.apache.spark.sql.functions.lit, val rows = df.count() val summary = df.describe().filter($"summary" === "count") summary.select(df.columns.map(c =>(lit(rows) - col(c)).alias(c)): _*).show </code></pre> In <code>Python</code>: <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql.functions import lit rows = df.count() summary = df.describe().filter(col("summary") == "count") summary.select(*((lit(rows)-col(c)).alias(c) for c in df.columns)).show() </code></pre>

Count the number of missing values in a dataframe Spark

Tags:

dataframe

apache-spark

apache-spark-sql

pyspark

I have a dataset with missing values , I would like to get the number of missing values for each columns. Following is what I did , I got the number of non missing values. How can I use it to get the number of missing values?

df.describe().filter($"summary" === "count").show

+-------+---+---+---+
|summary|  x|  y|  z|
+-------+---+---+---+
|  count|  1|  2|  3|
+-------+---+---+---+

Any help please to get a dataframe in which we'll find columns and number of missing values for each one.

863

asked Jun 07 '17 12:06

Maher HTB

2 Answers

You could count the missing values by summing the boolean output of the isNull() method, after converting it to type integer:

In Scala:

import org.apache.spark.sql.functions.{sum, col}
df.select(df.columns.map(c => sum(col(c).isNull.cast("int")).alias(c)): _*).show

In Python:

from pyspark.sql.functions import col,sum
df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show()

Alternatively, you could also use the output of df.describe().filter($"summary" === "count"), and subtract the number in each cell by the number of rows in the data:

In Scala:

import org.apache.spark.sql.functions.lit,

val rows = df.count()
val summary = df.describe().filter($"summary" === "count")
summary.select(df.columns.map(c =>(lit(rows) - col(c)).alias(c)): _*).show

In Python:

from pyspark.sql.functions import lit

rows = df.count()
summary = df.describe().filter(col("summary") == "count")
summary.select(*((lit(rows)-col(c)).alias(c) for c in df.columns)).show()

145

answered Oct 19 '22 03:10

mtoto

from pyspark.sql.functions import isnull, when, count, col
nacounts = df.select([count(when(isnull(c), c)).alias(c) for c in df.columns]).toPandas()
nacounts

answered Oct 19 '22 03:10

Harish Vasudev

Related questions
                            
                                detecting word boundary with regex in data frame in R
                            
                                How to subtract all rows in a dataframe with a row from another dataframe?
                            
                                Add a percent sign to a dataframe column in Python
                            
                                Unable to join pandas dataframe on string type
                            
                                using R - delete rows when a value repeated less than 3 times
                            
                                Add an image to a table-like output in R
                            
                                Get column name which contains a specific value at any rows in python pandas
                            
                                Filter dataframe matching column values with list values in python [duplicate]
                            
                                Substring, Pad and Paste Columns in Dataframe without a Loop
                            
                                Dataframe, keep only one column
                            
                                Comparing 2 columns of two Python Pandas dataframes and getting the common rows
                            
                                check for duplicates in Pyspark Dataframe
                            
                                Pandas Split DataFrame using row index
                            
                                Remove all quotes within values in Pandas
                            
                                how to read json with schema in spark dataframes/spark sql
                            
                                'subset' not working for drop_duplicates pandas dataframe
                            
                                Accumulate values for every possible combination in R
                            
                                Pandas pivot produces "ValueError: Index contains duplicate entries, cannot reshape" [duplicate]
                            
                                Remove values that appear only once in a DataFrame column
                            
                                Python: Adding hours to pandas timestamp

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With