Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark Window.partitionBy vs groupBy

Lets say I have a dataset with around 2.1 billion records.

It's a dataset with customer information and I want to know how many times they did something. So I should group on the ID and sum one column (It has 0 and 1 values where the 1 indicates an action).

Now, I can use a simple groupBy and agg(sum) it, but to my understanding this is not really efficient. The groupBy will move around a lot of data between partitions.

Alternatively, I can also use a Window function with a partitionBy clause and then sum the data. One of the disadvantage is that I'll then have to apply an extra filter cause it keeps all the data. And I want one record per ID.

But I don't see how this Window handles the data. Is it better than this groupBy and sum. Or is it the same?

like image 946
Anton Mulder Avatar asked Nov 08 '17 08:11

Anton Mulder


1 Answers

As far as I know, when working with spark DataFrames, the groupBy operation is optimized via Catalyst. The groupBy on DataFrames is unlike the groupBy on RDDs.

For instance, the groupBy on DataFrames performs the aggregation on partitions first, and then shuffles the aggregated results for the final aggregation stage. Hence, only the reduced, aggregated results get shuffled, not the entire data. This is similar to reduceByKey or aggregateByKey on RDDs. See this related SO-article with a nice example.

In addition, see slide 5 in this presentation by Yin Huai which covers the benefits of using DataFrames in conjunction with Catalyst.

Concluding, I think you're fine employing groupBy when using spark DataFrames. Using Window does not seem appropriate to me for your requirement.

like image 101
pansen Avatar answered Oct 09 '22 01:10

pansen