Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark dataframe cube method returning duplicate null values

I have the following data (you can reproduce it by copying and pasting):

from pyspark.sql import Row
l = [Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=None), Row(value=None), Row(value=None), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=True), Row(value=None), Row(value=None), Row(value=None), Row(value=True), Row(value=None), Row(value=True), Row(value=None)]
l_df = spark.createDataFrame(l)

Let's take a look at the schema of l_df:

l_df.printSchema()

root
|-- value: boolean (nullable = true)

Now I want to use cube() to count the frequency of each distinct value in the value column:

l_df.cube("value").count().show()

But I see two types of null values!

+-----+-----+
|value|count|
+-----+-----+
| true|   67|
| null|  100|
| null|   33|
+-----+-----+

To verify that I don't actually have two types of null:

l_df.select("value").distinct().collect()

And there is indeed only one type of null:

[Row(value=None), Row(value=True)]

Just to double check:

l_df.select("value").distinct().count()

And it returns 2.

I also noticed that len(l) is 100 and the first null is equal to this number. Why is this happening?

System info: Spark 2.1.0, Python 2.7.8, [GCC 4.1.2 20070626 (Red Hat 4.1.2-14)] on linux2

like image 216
versatile parsley Avatar asked Oct 25 '25 12:10

versatile parsley


1 Answers

These are not two types of nulls but results of different level aggregations. As explained in What is the difference between cube, rollup and groupBy operators? your cube application is equivalent to:

SELECT NULL AS value, COUNT(*) FROM df
UNION ALL
SELECT value, COUNT(*) FROM df GROUP BY value

The first query generates tuple (null, 100) (total number of records) where NULL is just a placeholder, and the second query generates tuples (true, 67), (null, 33) where NULL is one of the levels of value column.

It is easy to check with grouping (or grouping_id):

from pyspark.sql.functions import grouping, count

l_df.cube("value").agg(count("*"), grouping("value")).show()
# +-----+--------+---------------+
# |value|count(1)|grouping(value)|
# +-----+--------+---------------+
# | true|      67|              0|
# | null|     100|              1|
# | null|      33|              0|
# +-----+--------+---------------+
like image 96
Alper t. Turker Avatar answered Oct 27 '25 01:10

Alper t. Turker



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!