Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

null value and countDistinct with spark dataframe

I have a very simple dataframe

  df = spark.createDataFrame([(None,1,3),(2,1,3),(2,1,3)], ['a','b','c'])

  +----+---+---+
  |   a|  b|  c|
  +----+---+---+
  |null|  1|  3|
  |   2|  1|  3|
  |   2|  1|  3|
  +----+---+---+

When I apply a countDistinct on this dataframe, I find different results depending on the method:

First method

  df.distinct().count()

2

It's the result I except, the 2 last rows are identical but the first one is distinct (because of the null value) from the 2 others

Second Method

  import pyspark.sql.functions as F
  df.agg(F.countDistinct("a","b","c")).show()

1

It seems that the way F.countDistinct deals with the null value is not intuitive for me.

Does it looks a bug or normal for you ? And if it is normal, how I can write something that output exactly the result of the first approach but in the same spirit than the second Method.

like image 846
Stéphane Soulier Avatar asked Oct 31 '16 15:10

Stéphane Soulier


1 Answers

countDistinct works the same way as Hive count(DISTINCT expr[, expr]):

count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL.

The first row is not included. This is common for SQL functions.

like image 156
user6022341 Avatar answered Nov 11 '22 21:11

user6022341