Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Flink: What is the difference of groupBy and partitioning in the DataSet API?

Tags:

apache-flink

There are various partitioning function in Flink's Dataset API, such as partitionByHash and partitionByRange.

I would like to understand what is partitioning at the first place and what is the difference between groupBy and partitioning.

like image 227
Ganesh P Avatar asked Mar 07 '23 15:03

Ganesh P


1 Answers

Partitioning is a more low-level operation than groupBy and does not apply a function on the data. It rather defines how data is distributed across parallel task instances. Data can be partitioned with different methods such as hash partitioning or range partitioning.

groupBy is not an operation by itself. It always needs a function that is applied on the grouped DataSet such as reduce, groupReduce, or groupCombine. The groupBy API defines how records are grouped before they are given into the respective function. Grouping of records happens in two steps.

  1. All records with the same grouping key must be moved to the same task instance. This is done by partitioning the data. Since there are usually more distinct grouping keys than task instances, a task instance must handle records with distinct grouping keys.
  2. All records in the same task instance must be grouped on the key. This is usually done by sorting the data.

So, the first step of groupBy is partitioning.

like image 116
Fabian Hueske Avatar answered May 01 '23 11:05

Fabian Hueske