We've found experimentally that setting an explicit # of output shards in Dataflow/Apache Beam pipelines results in much worse performance. Our evidence suggests that Dataflow secretly does another GroupBy at the end. We've moved to letting Dataflow select the # of shards automatically (shards=0). However, for some pipelines this results in a huge number of relatively small output files (~15K files, each <1MB).
Is there anyway to send hints to Dataflow about the expected size of the outputs so it can scale accordingly? We notice that this problem happens mostly when the input dataset is quite large and the output is much smaller.
We're using Apache Beam Python 2.2.
The Apache Beam programming model simplifies the mechanics of large-scale data processing. Using one of the Apache Beam SDKs, you build a program that defines the pipeline. Then, one of Apache Beam's supported distributed processing backends, such as Dataflow, executes the pipeline.
PCollection : A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism.
Which of these accurately describes the relationship between Apache Beam and Cloud Dataflow? Apache Beam is the API for data pipeline building in java or python and Cloud Dataflow is the implementation and execution framework.
There is no way to check size of the PCollection without applying a PTransform on it (such as Count. globally() or Combine. combineFn()) because PCollection is not like a typical Collection in Java SDK or so.
This type of hinting is not supported in Dataflow / Apache Beam. In general, Dataflow and Apache Beam are designed to be as "no knobs" as possible, for a couple reasons:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With