filter
on basic scala collections containing null
values has the following (and quite intuitive) behaviour:
scala> List("a", "b", null).filter(_ != "a")
res0: List[String] = List(b, null)
However, I was very surprised to figure out that the following filter removes nulls in spark dataframe:
scala> val df = List(("a", null), ( "c", "d")).toDF("A", "B")
scala> df.show
+---+----+
| A| B|
+---+----+
| a|null|
| c| d|
+---+----+
scala> df.filter('B =!= "d").show
+---+---+
| A| B|
+---+---+
+---+---+
If I want to keep null
values, I should add
df.filter('B =!= "d" || 'B.isNull).show
+---+----+
| A| B|
+---+----+
| a|null|
+---+----+
Personally, I think that removing nulls by default is very error prone. Why this choice? and why is not explicitely stated in the api documentation? Am I missing something?
This is because the standard for SQL is not to be null-safe - so Spark SQL follows this (but not Scala).
Spark dataframes has a null-safe equality though
scala> df.filter($"B" <=> null).show
+---+----+
| A| B|
+---+----+
| a|null|
+---+----+
scala> df.filter(not($"B" <=> "d")).show
+---+----+
| A| B|
+---+----+
| a|null|
+---+----+
Note in edit: the point of not being null safe by default is so to allow null as a result of a test. Is a missing value equal to "c"? We don't know. Is a missing value equal to another missing value? We don't know either. But in a filter, null is false.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With