null value and countDistinct with spark dataframe

Question

I have a very simple dataframe

  df = spark.createDataFrame([(None,1,3),(2,1,3),(2,1,3)], ['a','b','c'])

  +----+---+---+
  |   a|  b|  c|
  +----+---+---+
  |null|  1|  3|
  |   2|  1|  3|
  |   2|  1|  3|
  +----+---+---+

When I apply a countDistinct on this dataframe, I find different results depending on the method:

First method

  df.distinct().count()

2

It's the result I except, the 2 last rows are identical but the first one is distinct (because of the null value) from the 2 others

Second Method

  import pyspark.sql.functions as F
  df.agg(F.countDistinct("a","b","c")).show()

1

It seems that the way F.countDistinct deals with the null value is not intuitive for me.

Does it looks a bug or normal for you ? And if it is normal, how I can write something that output exactly the result of the first approach but in the same spirit than the second Method.

user6022341 · Accepted Answer

countDistinct works the same way as Hive count(DISTINCT expr[, expr]):

count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL.

The first row is not included. This is common for SQL functions.

null value and countDistinct with spark dataframe

Tags:

apache-spark

pyspark

pyspark-sql

First method

Second Method

Stéphane Soulier

1 Answers

user6022341

Recent Activity

Donate For Us

null value and countDistinct with spark dataframe

Tags:

apache-spark

pyspark

pyspark-sql

First method

Second Method

Stéphane Soulier

1 Answers

user6022341

Related questions

Recent Activity

Donate For Us