Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to prevent fusion in Google Dataflow?

From: https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion

You can insert a GroupByKey and ungroup after your first ParDo. The Dataflow service never fuses ParDo operations across an aggregation.

This is what I came up with in python - is this reasonable / is there a simpler way?

def prevent_fuse(collection):
    return (
        collection
        | beam.Map(lambda x: (x, 1))
        | beam.GroupByKey()
        | beam.FlatMap(lambda x: (x[0] for v in x[1]))
        )

EDIT, in response to Ben Chambers' question

We want to prevent fusion because we have a collection which generates a much larger collection, and we need parallelization across the larger collection. If it fuses, I only get one worker across the larger collection.

like image 313
Maximilian Avatar asked Mar 08 '23 16:03

Maximilian


2 Answers

Apache Beam SDK 2.3.0 adds the experimental Reshuffle transform, which is the Python alternative to the Reshuffle.viaRandomKey operation mentioned by @BenChambers. You can use it in place of your custom prevent_fuse code.

like image 81
deepyaman Avatar answered Mar 20 '23 11:03

deepyaman


That should work. There are other ways, but they partly depend on what you are trying to do and why you want to prevent fusion. Keep in mind that fusion is an important optimization to improve the performance of your pipeline.

Could you elaborate on why you want to prevent fusion?

like image 30
Ben Chambers Avatar answered Mar 20 '23 13:03

Ben Chambers