Lets say I have a dataset with around 2.1 billion records.
It's a dataset with customer information and I want to know how many times they did something. So I should group on the ID and sum one column (It has 0 and 1 values where the 1 indicates an action).
Now, I can use a simple groupBy
and agg(sum)
it, but to my understanding this is not really efficient. The groupBy
will move around a lot of data between partitions.
Alternatively, I can also use a Window function with a partitionBy
clause and then sum the data. One of the disadvantage is that I'll then have to apply an extra filter cause it keeps all the data. And I want one record per ID.
But I don't see how this Window handles the data. Is it better than this groupBy and sum. Or is it the same?
As far as I know, when working with spark DataFrames, the groupBy
operation is optimized via Catalyst. The groupBy
on DataFrames is unlike the groupBy
on RDDs.
For instance, the groupBy
on DataFrames performs the aggregation on partitions first, and then shuffles the aggregated results for the final aggregation stage. Hence, only the reduced, aggregated results get shuffled, not the entire data. This is similar to reduceByKey
or aggregateByKey
on RDDs. See this related SO-article with a nice example.
In addition, see slide 5 in this presentation by Yin Huai which covers the benefits of using DataFrames in conjunction with Catalyst.
Concluding, I think you're fine employing groupBy
when using spark DataFrames. Using Window
does not seem appropriate to me for your requirement.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With