There are various partitioning function in Flink's Dataset API, such as partitionByHash and partitionByRange.
I would like to understand what is partitioning at the first place and what is the difference between groupBy and partitioning.
Partitioning is a more low-level operation than groupBy and does not apply a function on the data. It rather defines how data is distributed across parallel task instances. Data can be partitioned with different methods such as hash partitioning or range partitioning.
groupBy is not an operation by itself. It always needs a function that is applied on the grouped DataSet such as reduce, groupReduce, or groupCombine. The groupBy API defines how records are grouped before they are given into the respective function. Grouping of records happens in two steps.
So, the first step of groupBy is partitioning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With