Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why does filter remove null value by default on spark dataframe?

filter on basic scala collections containing null values has the following (and quite intuitive) behaviour:

scala> List("a", "b", null).filter(_ != "a")
res0: List[String] = List(b, null)

However, I was very surprised to figure out that the following filter removes nulls in spark dataframe:

scala> val df = List(("a", null), ( "c", "d")).toDF("A", "B")
scala> df.show
+---+----+
|  A|   B|
+---+----+
|  a|null|
|  c|   d|
+---+----+
scala> df.filter('B =!= "d").show
+---+---+
|  A|  B|
+---+---+
+---+---+

If I want to keep null values, I should add

df.filter('B =!= "d" || 'B.isNull).show
+---+----+
|  A|   B|
+---+----+
|  a|null|
+---+----+

Personally, I think that removing nulls by default is very error prone. Why this choice? and why is not explicitely stated in the api documentation? Am I missing something?

like image 672
greg hor Avatar asked Mar 05 '18 14:03

greg hor


1 Answers

This is because the standard for SQL is not to be null-safe - so Spark SQL follows this (but not Scala).

Spark dataframes has a null-safe equality though

scala> df.filter($"B" <=> null).show
+---+----+
|  A|   B|
+---+----+
|  a|null|
+---+----+


scala> df.filter(not($"B" <=> "d")).show
+---+----+
|  A|   B|
+---+----+
|  a|null|
+---+----+

Note in edit: the point of not being null safe by default is so to allow null as a result of a test. Is a missing value equal to "c"? We don't know. Is a missing value equal to another missing value? We don't know either. But in a filter, null is false.

like image 136
Wilmerton Avatar answered Oct 20 '22 19:10

Wilmerton