In below example df.a == 1
predicate returns correct result but df.a == None
returns 0 when it should return 1.
l = [[1], [1], [2], [2], [None]]
df = sc.parallelize(l).toDF(['a'])
df # DataFrame[a: bigint]
df.collect() # [Row(a=1), Row(a=1), Row(a=2), Row(a=2), Row(a=None)]
df.where(df.a == 1).count() # 2L
df.where(df.a == None).count() # 0L
Using Spark 1.3.1
You can use Column.isNull
method:
df.where(df.a.isNull()).count()
On a side note this behavior is what one could expect from a normal SQL query. Since NULL
marks "missing information and inapplicable information" [1] it doesn't make sense to ask if something is equal to NULL
. It simply either IS
or IS NOT
missing.\
Scala API provides special null-safe equality <=>
operator so it is possible to do something like this:
df.where($"a" <=> lit(null))
but it doesn't look like a good idea if you ask me.
1.Wikipedia, Null (SQL)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With