Apache Flink: What is the difference of groupBy and partitioning in the DataSet API?

Question

There are various partitioning function in Flink's Dataset API, such as partitionByHash and partitionByRange.

I would like to understand what is partitioning at the first place and what is the difference between groupBy and partitioning.

Fabian Hueske · Accepted Answer

Partitioning is a more low-level operation than groupBy and does not apply a function on the data. It rather defines how data is distributed across parallel task instances. Data can be partitioned with different methods such as hash partitioning or range partitioning.

groupBy is not an operation by itself. It always needs a function that is applied on the grouped DataSet such as reduce, groupReduce, or groupCombine. The groupBy API defines how records are grouped before they are given into the respective function. Grouping of records happens in two steps.

All records with the same grouping key must be moved to the same task instance. This is done by partitioning the data. Since there are usually more distinct grouping keys than task instances, a task instance must handle records with distinct grouping keys.
All records in the same task instance must be grouped on the key. This is usually done by sorting the data.

So, the first step of groupBy is partitioning.

Apache Flink: What is the difference of groupBy and partitioning in the DataSet API?

Tags:

apache-flink

Ganesh P

1 Answers

Fabian Hueske

Recent Activity

Donate For Us

Apache Flink: What is the difference of groupBy and partitioning in the DataSet API?

Tags:

apache-flink

Ganesh P

1 Answers

Fabian Hueske

Related questions

Recent Activity

Donate For Us