What is the purpose of org.apache.beam.sdk.transforms.Reshuffle? In the documentation the purpose is defined as:
A PTransform that returns a PCollection equivalent to its input but operationally provides some of the side effects of a GroupByKey, in particular preventing fusion of the surrounding transforms, checkpointing and deduplication by id.
What is the benefit of preventing fusion of the surrounding transforms? I thought fusion is an optimization to prevent unnecessarily steps. Actual use case would be helpful.
Adds a temporary random key to each element in a collection, reshuffles these keys, and removes the temporary key. This redistributes the elements between workers and returns a collection equivalent to its input collection. This is most useful for adjusting parallelism or preventing coupled failures.
Airflow shines in data orchestration and pipeline dependency management, while Beam is a unified tool for building big data pipelines, which can be executed in the most popular data processing systems such as Spark or Flink.
Each pipeline represents a single, repeatable job. A PCollection represents a potentially distributed, multi-element dataset that acts as the pipeline's data. Apache Beam transforms use PCollection objects as inputs and outputs for each step in your pipeline.
Apache Beam means a unified programming model. It implements batch and streaming data processing jobs that run on any execution engine. It executes pipelines in multiple execution environments. Apache Spark defines as a fast and general engine for large-scale data processing.
There are a couple cases when you may want to reshuffle your data. The following is not an exhaustive list, but should give you and idea about why you may reshuffle:
This means that the parallelism is increased after your ParDo. If you don't break the fusion here, your pipeline will not be able to split data into multiple machines to process it.
Consider the extreme case of a DoFn that generates a million output elements for every input element. Consider that this ParDo receives 10 elements in its input. If you don't break fusion between this high-fanout ParDo and its downstream transforms, it will only be able to run on 10 machines, although you will have millions of elements.
Imagine that your pipeline consumes 9 files of 10MB and one file of 10GB. If each file is read by a single machine, you will have one machine with a lot more data than the others.
If you don't reshuffle this data, most of your machines will be idle while your pipeline runs. Reshuffling it allows you to rebalance the data to be processed more evenly across machines.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With