Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get rows from DF that contain value None in pyspark (spark)

In below example df.a == 1 predicate returns correct result but df.a == None returns 0 when it should return 1.

l = [[1], [1], [2], [2], [None]]
df = sc.parallelize(l).toDF(['a'])
df    # DataFrame[a: bigint]
df.collect()    # [Row(a=1), Row(a=1), Row(a=2), Row(a=2), Row(a=None)]
df.where(df.a == 1).count()   # 2L
df.where(df.a == None).count()   # 0L

Using Spark 1.3.1

like image 284
marcin_koss Avatar asked Aug 26 '15 20:08

marcin_koss


1 Answers

You can use Column.isNull method:

df.where(df.a.isNull()).count()

On a side note this behavior is what one could expect from a normal SQL query. Since NULL marks "missing information and inapplicable information" [1] it doesn't make sense to ask if something is equal to NULL. It simply either IS or IS NOT missing.\

Scala API provides special null-safe equality <=> operator so it is possible to do something like this:

df.where($"a" <=> lit(null))

but it doesn't look like a good idea if you ask me.

1.Wikipedia, Null (SQL)

like image 87
zero323 Avatar answered Sep 18 '22 23:09

zero323