I have a data frame with some columns, and before doing analysis, I'd like to understand how complete the data frame is. So I want to filter the data frame and count for each column the number of non-null values, possibly returning a dataframe back.
Basically, I am trying to achieve the same result as expressed in this question but using Scala instead of Python.
Say you have:
val row = Row("x", "y", "z")
val df = sc.parallelize(Seq(row(0, 4, 3), row(None, 3, 4), row(None, None, 5))).toDF()
How can you summarize the number of non-null values for each column and return a dataframe with the same number of columns and just a single row with the answer?
In PySpark DataFrame you can calculate the count of Null, None, NaN & Empty/Blank values in a column by using isNull() of Column class & SQL functions isnan() count() and when().
One straight forward option is to use .describe()
function to get a summary of your data frame, where the count row includes a count of non-null values:
df.describe().filter($"summary" === "count").show
+-------+---+---+---+
|summary| x| y| z|
+-------+---+---+---+
| count| 1| 2| 3|
+-------+---+---+---+
Although I like Psidoms answer, often I'm more interested in the fraction of null-values, because just the number of non-null values doesn't tell much...
You can do something like:
import org.apache.spark.sql.functions.{sum,when, count}
df.agg(
(sum(when($"x".isNotNull,0).otherwise(1))/count("*")).as("x : fraction null"),
(sum(when($"y".isNotNull,0).otherwise(1))/count("*")).as("y : fraction null"),
(sum(when($"z".isNotNull,0).otherwise(1))/count("*")).as("z : fraction null")
).show()
EDIT: sum(when($"x".isNotNull,0).otherwise(1))
can also just be replaced by count($"x")
which only counts non-null values. As I find this not obvious, I tend to use the sum
notation which is more clear
Here's how I did it in Scala 2.11, Spark 2.3.1:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
df.agg(
count("x").divide(count(lit(1)))
.as("x: percent non-null")
// ...copy paste that for columns y and z
).head()
count(*)
counts non-null rows, count(1)
runs on every row.
If you instead want to count percent null in population, find the complement of our count-based equation:
lit(1).minus(
count("x").divide(count(lit(1)))
)
.as("x: percent null")
It's also worth knowing that you can cast nullness to an integer, then sum it.
But it's probably less performant:
// cast null-ness to an integer
sum(col("x").isNull.cast(IntegerType))
.divide(count(lit(1)))
.as("x: percent null")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With