Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between shuffle() and rebalance() in Apache Flink

I am working on my bachelor's final project, which is about the comparison between Apache Spark Streaming and Apache Flink (only streaming) and I have just arrived to "Physical partitioning" in Flink's documentation. The matter is that in this documentation it doesn't explain well how this two transformations work. Directly from the documentation:

shuffle(): Partitions elements randomly according to a uniform distribution.

rebalance(): Partitions elements round-robin, creating equal load per partition. Useful for performance optimisation in the presence of data skew.

Source: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/datastream_api.html#physical-partitioning

Both are automatically done, so what I understand is that they both redistribute equally (shuffle() > uniform distribution & rebalance() > round-robin) and randomly the data. Then I deduce that rebalance() distributes the data in a better way ("equal load per partitions") so the tasks have to process the same amount of data, but shuffle() may create bigger and smaller partitions. Then, in which cases might you prefer to use shuffle() than rebalance()?

The only thing that comes to my mind is that probably rebalance()requires some processing time so in some cases it might use more time to do the rebalancing than the time it will improve in the future transformations.

I have been looking for this and nobody has talked about this, only in a mailing list of Flink, but they don't explain how shuffle() works.

Thanks to Sneftel who has helped me to improve my question asking me things to let me rethink about what I wanted to ask; and to Till who answered quite well my question. :D

like image 221
froblesmartin Avatar asked May 13 '17 18:05

froblesmartin


People also ask

What is KeyBy in Flink?

According to the Apache Flink documentation, KeyBy transformation logically partitions a stream into disjoint partitions. All records with the same key are assigned to the same partition.

What is sink in Flink?

This connector provides a Sink that writes partitioned files to filesystems supported by the Flink FileSystem abstraction. The streaming file sink writes incoming data into buckets. Given that the incoming streams can be unbounded, data in each bucket are organized into part files of finite size.

What is keyed stream in Flink?

Using keyed streams - Flink Tutorial Flink users are hashing algorithms to divide the stream by partitions based on the number of slots allocated to the job. It then distributes the same keys to the same slots. Partitioning by key is ideal for aggregation operations that aggregate on a specific key.

Can Flink be used for batch processing?

Flink is a framework and distributed processing engine for batch and stream data processing. Its structure enables it to process a finite amount of data and infinite streams of data. Flink provides data source/sink connectors to read data from and write data to external systems like Kafka, HDFS, Cassandra etc.


1 Answers

As the documentation states, shuffle will randomly distribute the data whereas rebalance will distribute the data in a round robin fashion. The latter is more efficient since you don't have to compute a random number. Moreover, depending on the randomness, you might end up with some kind of not so uniform distribution.

On the other hand, rebalance will always start sending the first element to the first channel. Thus, if you have only few elements (fewer elements than subtasks), then only some of the subtasks will receive elements, because you always start to send the first element to the first subtask. In the streaming case this should eventually not matter because you usually have an unbounded input stream.

The actual reason why both methods exist is a historically reason. shuffle was introduced first. In order to make the batch an streaming API more similar, rebalance was then introduced.

like image 147
Till Rohrmann Avatar answered Oct 28 '22 00:10

Till Rohrmann