Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Controlling Dataflow/Apache Beam output sharding

We've found experimentally that setting an explicit # of output shards in Dataflow/Apache Beam pipelines results in much worse performance. Our evidence suggests that Dataflow secretly does another GroupBy at the end. We've moved to letting Dataflow select the # of shards automatically (shards=0). However, for some pipelines this results in a huge number of relatively small output files (~15K files, each <1MB).

Is there anyway to send hints to Dataflow about the expected size of the outputs so it can scale accordingly? We notice that this problem happens mostly when the input dataset is quite large and the output is much smaller.

We're using Apache Beam Python 2.2.

like image 741
Josh Sacks Avatar asked Mar 27 '18 18:03

Josh Sacks


People also ask

Does Dataflow use Apache Beam?

The Apache Beam programming model simplifies the mechanics of large-scale data processing. Using one of the Apache Beam SDKs, you build a program that defines the pipeline. Then, one of Apache Beam's supported distributed processing backends, such as Dataflow, executes the pipeline.

What is PCollection?

PCollection : A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism.

Which of these accurately describes the relationship between Apache Beam and cloud dataflow?

Which of these accurately describes the relationship between Apache Beam and Cloud Dataflow? Apache Beam is the API for data pipeline building in java or python and Cloud Dataflow is the implementation and execution framework.

How do I know if PCollection is empty?

There is no way to check size of the PCollection without applying a PTransform on it (such as Count. globally() or Combine. combineFn()) because PCollection is not like a typical Collection in Java SDK or so.


1 Answers

This type of hinting is not supported in Dataflow / Apache Beam. In general, Dataflow and Apache Beam are designed to be as "no knobs" as possible, for a couple reasons:

  1. To allow the Dataflow service to intelligently make optimization decisions on its own. Dataflow has smart autoscaling capabilities which can scale the number of worker VMs up or down according to the current workload.
  2. To ensure that pipelines written with the Apache Beam SDK are portable across runners (such as Dataflow, Spark or Flink). Pipeline logic is written in terms of a set of abstractions such that the job can be run in a variety of environments. Each worker may apply its own set of optimizations to these high-level abstractions.
like image 183
Scott Wegner Avatar answered Oct 09 '22 06:10

Scott Wegner