There are various partitioning function in Flink's Dataset API, such as partitionByHash
and partitionByRange
.
I would like to understand what is partitioning at the first place and what is the difference between groupBy
and partitioning.
Partitioning is a more low-level operation than groupBy
and does not apply a function on the data. It rather defines how data is distributed across parallel task instances. Data can be partitioned with different methods such as hash partitioning or range partitioning.
groupBy
is not an operation by itself. It always needs a function that is applied on the grouped DataSet
such as reduce
, groupReduce
, or groupCombine
. The groupBy
API defines how records are grouped before they are given into the respective function. Grouping of records happens in two steps.
So, the first step of groupBy
is partitioning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With