How to get rows from DF that contain value None in pyspark (spark)

Question

In below example df.a == 1 predicate returns correct result but df.a == None returns 0 when it should return 1.

l = [[1], [1], [2], [2], [None]]
df = sc.parallelize(l).toDF(['a'])
df    # DataFrame[a: bigint]
df.collect()    # [Row(a=1), Row(a=1), Row(a=2), Row(a=2), Row(a=None)]
df.where(df.a == 1).count()   # 2L
df.where(df.a == None).count()   # 0L

Using Spark 1.3.1

zero323 · Accepted Answer

You can use Column.isNull method:

df.where(df.a.isNull()).count()

On a side note this behavior is what one could expect from a normal SQL query. Since NULL marks "missing information and inapplicable information" [1] it doesn't make sense to ask if something is equal to NULL. It simply either IS or IS NOT missing.\

Scala API provides special null-safe equality <=> operator so it is possible to do something like this:

df.where($"a" <=> lit(null))

but it doesn't look like a good idea if you ask me.

1.Wikipedia, Null (SQL)

How to get rows from DF that contain value None in pyspark (spark)

Tags:

python

apache-spark

pyspark

marcin_koss

1 Answers

zero323

Recent Activity

Donate For Us

How to get rows from DF that contain value None in pyspark (spark)

Tags:

python

apache-spark

pyspark

marcin_koss

1 Answers

zero323

Related questions

Recent Activity

Donate For Us