In below example df.a == 1 predicate returns correct result but df.a == None returns 0 when it should return 1.
l = [[1], [1], [2], [2], [None]]
df = sc.parallelize(l).toDF(['a'])
df # DataFrame[a: bigint]
df.collect() # [Row(a=1), Row(a=1), Row(a=2), Row(a=2), Row(a=None)]
df.where(df.a == 1).count() # 2L
df.where(df.a == None).count() # 0L
Using Spark 1.3.1
You can use Column.isNull method:
df.where(df.a.isNull()).count()
On a side note this behavior is what one could expect from a normal SQL query. Since NULL marks "missing information and inapplicable information" [1] it doesn't make sense to ask if something is equal to NULL. It simply either IS or IS NOT missing.\
Scala API provides special null-safe equality <=> operator so it is possible to do something like this:
df.where($"a" <=> lit(null))
but it doesn't look like a good idea if you ask me.
1.Wikipedia, Null (SQL)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With