Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between === null and isNull in Spark DataDrame

Tags:

I am bit confused with the difference when we are using

 df.filter(col("c1") === null) and df.filter(col("c1").isNull)  

Same dataframe I am getting counts in === null but zero counts in isNull. Please help me to understand the difference. Thanks

like image 311
John Avatar asked Jan 08 '17 13:01

John


2 Answers

First and foremost don't use null in your Scala code unless you really have to for compatibility reasons.

Regarding your question it is plain SQL. col("c1") === null is interpreted as c1 = NULL and, because NULL marks undefined values, result is undefined for any value including NULL itself.

spark.sql("SELECT NULL = NULL").show 
+-------------+ |(NULL = NULL)| +-------------+ |         null| +-------------+ 
spark.sql("SELECT NULL != NULL").show 
+-------------------+ |(NOT (NULL = NULL))| +-------------------+ |               null| +-------------------+ 
spark.sql("SELECT TRUE != NULL").show 
+------------------------------------+ |(NOT (true = CAST(NULL AS BOOLEAN)))| +------------------------------------+ |                                null| +------------------------------------+ 
spark.sql("SELECT TRUE = NULL").show 
+------------------------------+ |(true = CAST(NULL AS BOOLEAN))| +------------------------------+ |                          null| +------------------------------+ 

The only valid methods to check for NULL are:

  • IS NULL:

    spark.sql("SELECT NULL IS NULL").show 
    +--------------+ |(NULL IS NULL)| +--------------+ |          true| +--------------+ 
    spark.sql("SELECT TRUE IS NULL").show 
    +--------------+ |(true IS NULL)| +--------------+ |         false| +--------------+ 
  • IS NOT NULL:

    spark.sql("SELECT NULL IS NOT NULL").show 
    +------------------+ |(NULL IS NOT NULL)| +------------------+ |             false| +------------------+ 
    spark.sql("SELECT TRUE IS NOT NULL").show 
    +------------------+ |(true IS NOT NULL)| +------------------+ |              true| +------------------+ 

implemented in DataFrame DSL as Column.isNull and Column.isNotNull respectively.

Note:

For NULL-safe comparisons use IS DISTINCT / IS NOT DISTINCT:

spark.sql("SELECT NULL IS NOT DISTINCT FROM NULL").show 
+---------------+ |(NULL <=> NULL)| +---------------+ |           true| +---------------+ 
spark.sql("SELECT NULL IS NOT DISTINCT FROM TRUE").show 
+--------------------------------+ |(CAST(NULL AS BOOLEAN) <=> true)| +--------------------------------+ |                           false| +--------------------------------+ 

or not(_ <=> _) / <=>

spark.sql("SELECT NULL AS col1, NULL AS col2").select($"col1" <=> $"col2").show 
+---------------+ |(col1 <=> col2)| +---------------+ |           true| +---------------+ 
spark.sql("SELECT NULL AS col1, TRUE AS col2").select($"col1" <=> $"col2").show 
+---------------+ |(col1 <=> col2)| +---------------+ |          false| +---------------+ 

in SQL and DataFrame DSL respectively.

Related:

Including null values in an Apache Spark Join

like image 151
zero323 Avatar answered Oct 06 '22 10:10

zero323


Usually the best way to shed light onto unexpected results in Spark Dataframes is to look at the explain plan. Consider the following example:

import org.apache.spark.sql.{DataFrame, SparkSession} import org.apache.spark.sql.functions._  object Example extends App {    val session = SparkSession.builder().master("local[*]").getOrCreate()   case class Record(c1: String, c2: String)   val data = List(Record("a", "b"), Record(null, "c"))   val rdd = session.sparkContext.parallelize(data)   import session.implicits._    val df: DataFrame = rdd.toDF   val filtered = df.filter(col("c1") === null)   println(filtered.count()) // <-- outputs 0, not expected    val filtered2 = df.filter(col("c1").isNull)   println(filtered2.count())   println(filtered2) // <- outputs 1, as expected    filtered.explain(true)   filtered2.explain(true) } 

The first explain plan shows:

== Physical Plan == *Filter (isnotnull(c1#2) && null) +- Scan ExistingRDD[c1#2,c2#3] == Parsed Logical Plan == 'Filter isnull('c1) +- LogicalRDD [c1#2, c2#3] 

This filter clause looks nonsensical. The && to null ensures this can never resolve to true.

The second explain plan looks like:

== Physical Plan == *Filter isnull(c1#2) +- Scan ExistingRDD[c1#2,c2#3] 

Here the filter is what expect and want.

like image 29
mattinbits Avatar answered Oct 06 '22 11:10

mattinbits