Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Beam/Dataflow Reshuffle

What is the purpose of org.apache.beam.sdk.transforms.Reshuffle? In the documentation the purpose is defined as:

A PTransform that returns a PCollection equivalent to its input but operationally provides some of the side effects of a GroupByKey, in particular preventing fusion of the surrounding transforms, checkpointing and deduplication by id.

What is the benefit of preventing fusion of the surrounding transforms? I thought fusion is an optimization to prevent unnecessarily steps. Actual use case would be helpful.

like image 312
user_1357 Avatar asked Jan 10 '19 03:01

user_1357


People also ask

What does beam reshuffle do?

Adds a temporary random key to each element in a collection, reshuffles these keys, and removes the temporary key. This redistributes the elements between workers and returns a collection equivalent to its input collection. This is most useful for adjusting parallelism or preventing coupled failures.

What is the difference between Apache beam and airflow?

Airflow shines in data orchestration and pipeline dependency management, while Beam is a unified tool for building big data pipelines, which can be executed in the most popular data processing systems such as Spark or Flink.

What is PCollection in Apache beam?

Each pipeline represents a single, repeatable job. A PCollection represents a potentially distributed, multi-element dataset that acts as the pipeline's data. Apache Beam transforms use PCollection objects as inputs and outputs for each step in your pipeline.

What is Apache Beam vs spark?

Apache Beam means a unified programming model. It implements batch and streaming data processing jobs that run on any execution engine. It executes pipelines in multiple execution environments. Apache Spark defines as a fast and general engine for large-scale data processing.


1 Answers

There are a couple cases when you may want to reshuffle your data. The following is not an exhaustive list, but should give you and idea about why you may reshuffle:

When one of your ParDo transforms has a very high fanout

This means that the parallelism is increased after your ParDo. If you don't break the fusion here, your pipeline will not be able to split data into multiple machines to process it.

Consider the extreme case of a DoFn that generates a million output elements for every input element. Consider that this ParDo receives 10 elements in its input. If you don't break fusion between this high-fanout ParDo and its downstream transforms, it will only be able to run on 10 machines, although you will have millions of elements.

  • A good way to diagnose this is looking at the number of elements in an input PCollection vs the number of elements of an output PCollection. If the latter is significantly larger than the first, then you may want to consider adding a reshuffle.

When your data is not well balanced across machines**

Imagine that your pipeline consumes 9 files of 10MB and one file of 10GB. If each file is read by a single machine, you will have one machine with a lot more data than the others.

If you don't reshuffle this data, most of your machines will be idle while your pipeline runs. Reshuffling it allows you to rebalance the data to be processed more evenly across machines.

  • A good way to diagnose this is by looking at how many workers are executing work in your pipeline. If the pipeline is slow, and there is only one worker processing data, then you can benefit from a reshuffle.
like image 197
Pablo Avatar answered Dec 28 '22 14:12

Pablo