What magics does Flink use in distinct()? How are surrogate keys generated?

Question

Regarding generating surrogate key, the first step is to get the distinct and then build an incremental key for each tuple.

So I use Java Set to get the distinct elements and it's out of heap space. Then, I use Flink's distinct() and it totally works.

Could I ask what make this difference?

Another related question is, can Flink generate surrogate key in mapper?

Fabian Hueske · Accepted Answer

Flink executes a distinct() internally as a GroupBy followed by a ReduceGroup operator, where the reduce operator returns the first element of the group only.

The GroupBy is done by sorting the data. Sorting is done on a binary data representation, if possible in-memory, but might spill to disk if not enough memory is available. This blog post gives some insight about that. GroupBy and Sort are memory-safe in Flink and will not fail with an OutOfMemoryError.

You can also do a distinct on a custom key, by using DataSet.distinct(KeySelector ks). The key selector is basically a MapFunction that generates a custom key.

What magics does Flink use in distinct()? How are surrogate keys generated?

Tags:

apache-flink

HungUnicorn

1 Answers

Fabian Hueske

Recent Activity

Donate For Us

What magics does Flink use in distinct()? How are surrogate keys generated?

Tags:

apache-flink

HungUnicorn

1 Answers

Fabian Hueske

Related questions

Recent Activity

Donate For Us