Controlling Dataflow/Apache Beam output sharding

Q: Does Dataflow use Apache Beam?

The Apache Beam programming model simplifies the mechanics of large-scale data processing. Using one of the Apache Beam SDKs, you build a program that defines the pipeline. Then, one of Apache Beam's supported distributed processing backends, such as Dataflow, executes the pipeline.

Q: What is PCollection?

PCollection : A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism.

Q: Which of these accurately describes the relationship between Apache Beam and cloud dataflow?

Which of these accurately describes the relationship between Apache Beam and Cloud Dataflow? Apache Beam is the API for data pipeline building in java or python and Cloud Dataflow is the implementation and execution framework.

Q: How do I know if PCollection is empty?

There is no way to check size of the PCollection without applying a PTransform on it (such as Count. globally() or Combine. combineFn()) because PCollection is not like a typical Collection in Java SDK or so.

Tags:

python

apache-beam

google-cloud-dataflow

We've found experimentally that setting an explicit # of output shards in Dataflow/Apache Beam pipelines results in much worse performance. Our evidence suggests that Dataflow secretly does another GroupBy at the end. We've moved to letting Dataflow select the # of shards automatically (shards=0). However, for some pipelines this results in a huge number of relatively small output files (~15K files, each <1MB).

Is there anyway to send hints to Dataflow about the expected size of the outputs so it can scale accordingly? We notice that this problem happens mostly when the input dataset is quite large and the output is much smaller.

We're using Apache Beam Python 2.2.

741

asked Mar 27 '18 18:03

Josh Sacks

1 Answers

This type of hinting is not supported in Dataflow / Apache Beam. In general, Dataflow and Apache Beam are designed to be as "no knobs" as possible, for a couple reasons:

To allow the Dataflow service to intelligently make optimization decisions on its own. Dataflow has smart autoscaling capabilities which can scale the number of worker VMs up or down according to the current workload.
To ensure that pipelines written with the Apache Beam SDK are portable across runners (such as Dataflow, Spark or Flink). Pipeline logic is written in terms of a set of abstractions such that the job can be run in a variety of environments. Each worker may apply its own set of optimizations to these high-level abstractions.

183

answered Oct 09 '22 06:10

Scott Wegner

Related questions
                            
                                How to get uwsgi to exit with return code of any failed sub-process
                            
                                Why does importing numpy add 1 GB of virtual memory on Linux?
                            
                                Python hangs for hours on end of functions after creating huge object
                            
                                Python Gtk3 - Scroll TextView inside of ScrolledWindow by mouse and courser position
                            
                                How to preprocess a text stream on the fly in Python?
                            
                                scrapy crawl a set of links that might contains next pages
                            
                                Algorithm to get minimum movement to avoid square overlap
                            
                                Python returning type error on wrong line
                            
                                Segmenting numpy arrays with as_strided
                            
                                Search and Replace in pandas dataframe for large dataset
                            
                                InvalidArgumentError when loading tfrecord file
                            
                                Calling multiple instances of python scripts in matlab using java.lang.Runtime.getRuntime not working
                            
                                How to train statsmodels.tsa.ARIMA model with multiple series
                            
                                SqlAlchemy non persistent column
                            
                                Tensorflow, negative KL Divergence
                            
                                Merge two folders in python
                            
                                Can Selenium use a specific Firefox profile without making a copy
                            
                                Converting Tensor to Numpy Array - Custom Loss function In keras
                            
                                python h5py file read "OSError: Unable to open file (bad superblock version number)"
                            
                                Plot.ly pie chart result precision

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With