I have a very simple dataframe
df = spark.createDataFrame([(None,1,3),(2,1,3),(2,1,3)], ['a','b','c'])
+----+---+---+
| a| b| c|
+----+---+---+
|null| 1| 3|
| 2| 1| 3|
| 2| 1| 3|
+----+---+---+
When I apply a countDistinct
on this dataframe, I find different results depending on the method:
df.distinct().count()
2
It's the result I except, the 2 last rows are identical but the first one is distinct (because of the null value) from the 2 others
import pyspark.sql.functions as F
df.agg(F.countDistinct("a","b","c")).show()
1
It seems that the way F.countDistinct
deals with the null
value is not intuitive for me.
Does it looks a bug or normal for you ? And if it is normal, how I can write something that output exactly the result of the first approach but in the same spirit than the second Method.
countDistinct
works the same way as Hive count(DISTINCT expr[, expr])
:
count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL.
The first row is not included. This is common for SQL functions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With