Side output in ParDo | Apache Beam Python SDK

Tags:

google-cloud-dataflow

As the documentation is only available for JAVA, I could not really understand what it means.

It states - "While ParDo always produces a main output PCollection (as the return value from apply), you can also have your ParDo produce any number of additional output PCollections. If you choose to have multiple outputs, your ParDo will return all of the output PCollections (including the main output) bundled together. For example, in Java, the output PCollections are bundled in a type-safe PCollectionTuple."

I understand what bundled together means, but if i am yielding a tag in my DoFn, does it yields with a bundle with all other outputs empty on the go and yield other outputs when they are encountered in code? or it waits for all yields to be ready for a input and the outputs them all together in a bundle?

There isnt much clarity around it in the documentation. Although i think it doesnt wait and just yields when encountered, but I still need understand what is happening.

732

asked Sep 14 '18 20:09

IYY

Video Answer

1 Answers

The best way to answer this is with an example. This example is available in Beam.

Suppose that you want to run a word count pipeline (e.g. count the number of times each word appears in a document). For this you need to split lines in a file into individual words. Consider that you also want to count word lengths individually. Your splitting transform would be like so:

with beam.Pipeline(options=pipeline_options) as p:

    lines = p | ReadFromText(known_args.input)  # Read in the file

    # with_outputs allows accessing the explicitly tagged outputs of a DoFn.
    split_lines_result = (lines
                          | beam.ParDo(SplitLinesToWordsFn()).with_outputs(
                              SplitLinesToWordsFn.OUTPUT_TAG_CHARACTER_COUNT,
                              main='words'))

    short_words = split_lines_result['words']
    character_count = split_lines_result[
        SplitLinesToWordsFn.OUTPUT_TAG_CHARACTER_COUNT]

In this case, each is a different PCollection, with the right elements. The DoFn would be in charge of splitting its outputs, and it does it by tagging elements. See:

class SplitLinesToWordsFn(beam.DoFn):
  OUTPUT_TAG_CHARACTER_COUNT = 'tag_character_count'

  def process(self, element):
    # yield a count (integer) to the OUTPUT_TAG_CHARACTER_COUNT tagged
    # collection.
    yield pvalue.TaggedOutput(
        self.OUTPUT_TAG_CHARACTER_COUNT, len(element))

    words = re.findall(r'[A-Za-z\']+', element)
    for word in words:
      # yield word to add it to the main collection.
      yield word

As you can see, for the main output, you do not need to tag the elements, but for the other outputs you do.

122

answered Nov 02 '22 08:11

Pablo

Related questions
                            
                                Apache Beam/Google Dataflow PubSub to BigQuery Pipeline: Handling Insert Errors and Unexpected Retry Behavior
                            
                                Google Dataflow "No filesystem found for scheme gs"
                            
                                Apache Beam - Bigquery streaming insert showing RuntimeException: ManagedChannel allocation site
                            
                                Beam/Dataflow design pattern to enrich documents based on database queries
                            
                                Google DataFlow Apache Beam
                            
                                Sending credentials to Google Dataflow jobs
                            
                                Dataprep vs Dataflow vs Dataproc
                            
                                "No filesystem found for scheme gs" when running dataflow in google cloud platform
                            
                                Processing Total Ordering of Events By Key using Apache Beam
                            
                                Does Apache Beam support custom file names for its output?
                            
                                Failed to construct instance from factory method DataflowRunner#fromOptions in beamSql, apache beam
                            
                                Google DataFlow/Python: Import errors with save_main_session and custom modules in __main__
                            
                                Why do I need to shuffle my PCollection for it to autoscale on Cloud Dataflow?
                            
                                Exception Handling in Apache Beam pipelines using Python
                            
                                How can I debug why my Dataflow job is stuck?
                            
                                Application Default Credentials not working locally with App Engine
                            
                                Opening a gzip file in python Apache Beam
                            
                                Beam/Dataflow Python: AttributeError: '_UnwindowedValues' object has no attribute 'sort'
                            
                                Cannot write date in BigQuery using Java Bigquery Client API
                            
                                GCP Dataflow 2.0 PubSub to GCS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With