I have a data frame with some columns, and before doing analysis, I'd like to understand how complete the data frame is. So I want to filter the data frame and count for each column the number of non-null values, possibly returning a dataframe back. Basically, I am trying to achieve the same result as expressed in this question but using Scala instead of Python. Say you have: <pre class="prettyprint lang-scala prettyprint-override"><code>val row = Row("x", "y", "z") val df = sc.parallelize(Seq(row(0, 4, 3), row(None, 3, 4), row(None, None, 5))).toDF() </code></pre> How can you summarize the number of non-null values for each column and return a dataframe with the same number of columns and just a single row with the answer?

Here's how I did it in Scala 2.11, Spark 2.3.1: <pre class="prettyprint lang-scala prettyprint-override"><code>import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ df.agg( count("x").divide(count(lit(1))) .as("x: percent non-null") // ...copy paste that for columns y and z ).head() </code></pre> <code>count(*)</code> counts non-null rows, <code>count(1)</code> runs on every row. If you instead want to count percent null in population, find the complement of our count-based equation: <pre class="prettyprint lang-scala prettyprint-override"><code>lit(1).minus( count("x").divide(count(lit(1))) ) .as("x: percent null") </code></pre> <hr> It's also worth knowing that you can cast nullness to an integer, then sum it. But it's probably less performant: <pre class="prettyprint lang-scala prettyprint-override"><code>// cast null-ness to an integer sum(col("x").isNull.cast(IntegerType)) .divide(count(lit(1))) .as("x: percent null") </code></pre>

Count the number of non-null values in a Spark DataFrame

Tags:

null

count

scala

apache-spark

apache-spark-sql

I have a data frame with some columns, and before doing analysis, I'd like to understand how complete the data frame is. So I want to filter the data frame and count for each column the number of non-null values, possibly returning a dataframe back.

Basically, I am trying to achieve the same result as expressed in this question but using Scala instead of Python.

Say you have:

val row = Row("x", "y", "z")
val df = sc.parallelize(Seq(row(0, 4, 3), row(None, 3, 4), row(None, None, 5))).toDF()

How can you summarize the number of non-null values for each column and return a dataframe with the same number of columns and just a single row with the answer?

215

asked Jan 20 '17 14:01

user299791

3 Answers

One straight forward option is to use .describe() function to get a summary of your data frame, where the count row includes a count of non-null values:

df.describe().filter($"summary" === "count").show
+-------+---+---+---+
|summary|  x|  y|  z|
+-------+---+---+---+
|  count|  1|  2|  3|
+-------+---+---+---+

157

answered Oct 16 '22 21:10

Psidom

Although I like Psidoms answer, often I'm more interested in the fraction of null-values, because just the number of non-null values doesn't tell much...

You can do something like:

import org.apache.spark.sql.functions.{sum,when, count}

df.agg(
   (sum(when($"x".isNotNull,0).otherwise(1))/count("*")).as("x : fraction null"),
   (sum(when($"y".isNotNull,0).otherwise(1))/count("*")).as("y : fraction null"),
   (sum(when($"z".isNotNull,0).otherwise(1))/count("*")).as("z : fraction null")
 ).show()

EDIT: sum(when($"x".isNotNull,0).otherwise(1)) can also just be replaced by count($"x") which only counts non-null values. As I find this not obvious, I tend to use the sum notation which is more clear

answered Oct 16 '22 21:10

Raphael Roth

Here's how I did it in Scala 2.11, Spark 2.3.1:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

df.agg(
    count("x").divide(count(lit(1)))
        .as("x: percent non-null")
    // ...copy paste that for columns y and z
).head()

count(*) counts non-null rows, count(1) runs on every row.

If you instead want to count percent null in population, find the complement of our count-based equation:

lit(1).minus(
    count("x").divide(count(lit(1)))
    )
    .as("x: percent null")

It's also worth knowing that you can cast nullness to an integer, then sum it.
But it's probably less performant:

// cast null-ness to an integer
sum(col("x").isNull.cast(IntegerType))
    .divide(count(lit(1)))
    .as("x: percent null")

answered Oct 16 '22 21:10

Birchlabs

Related questions
                            
                                How to pass an object to a method in Scala
                            
                                Spark: sum over list containing None and Some()?
                            
                                C# equivalent of Scala List's Zip with Index?
                            
                                Is it possible to curry the other way around in Scala?
                            
                                When are Scala objects garbage collected?
                            
                                Possible to code generic return types in Scala similar to C++ templates?
                            
                                How do I get the runtime Class of a parameterized Type in a Scala trait
                            
                                is it possible to have a circular dependency between .java and .scala classes?
                            
                                How to pattern match on Scala's parser combinator result
                            
                                How would I implement a fixed size List in Scala?
                            
                                How to stay true to functional style in Scala for expressions
                            
                                Generic type unification: multiple parameters (T,T) vs. multiple parameter lists (T)(T)?
                            
                                Scala pattern match default guards
                            
                                Subtyping and type parameters in Scala
                            
                                Vector or MutableList / ListBuffer for performance
                            
                                QuickSort Traditional vs Functional Style What Causes This Difference?
                            
                                Use of Scala by-name parameters
                            
                                ScalaMock mocking a trait gives "MockFunction1 cannot be cast to StubFunction1"
                            
                                how to concatenate option in scala
                            
                                How do I use "not rlike" in spark-sql?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Count the number of non-null values in a Spark DataFrame

Tags:

null

count

scala

apache-spark

apache-spark-sql

user299791

People also ask

3 Answers

Psidom

Raphael Roth

Birchlabs

Recent Activity

Donate For Us