I want to understand in which scenario that I should use FlatMap or Map. The documentation did not seem clear to me. I still do not understand in which scenario I should use the transformation of FlatMap or Map. Could someone give me an example so I can understand their difference? I understand the difference of FlatMap vs Map in Spark, and however not sure if there any similarity?

<blockquote> These transforms in Beam are exactly same as Spark (Scala too). </blockquote> A <code>Map</code> transform, maps from a <code>PCollection</code> of N elements into another <code>PCollection</code> of N elements. A <code>FlatMap</code> transform maps a <code>PCollections</code> of N elements into N collections of zero or more elements, which are then flattened into a single <code>PCollection</code>. As a simple example, the following happens: <pre class="prettyprint"><code>beam.Create([1, 2, 3]) | beam.Map(lambda x: [x, 'any']) # The result is a collection of THREE lists: [[1, 'any'], [2, 'any'], [3, 'any']] </code></pre> Whereas: <pre class="prettyprint"><code>beam.Create([1, 2, 3]) | beam.FlatMap(lambda x: [x, 'any']) # The lists that are output by the lambda, are then flattened into a # collection of SIX single elements: [1, 'any', 2, 'any', 3, 'any'] </code></pre>

Apache Beam : FlatMap vs Map?

1 Answers

These transforms in Beam are exactly same as Spark (Scala too).

A Map transform, maps from a PCollection of N elements into another PCollection of N elements.

A FlatMap transform maps a PCollections of N elements into N collections of zero or more elements, which are then flattened into a single PCollection.

As a simple example, the following happens:

beam.Create([1, 2, 3]) | beam.Map(lambda x: [x, 'any']) # The result is a collection of THREE lists: [[1, 'any'], [2, 'any'], [3, 'any']]

Whereas:

beam.Create([1, 2, 3]) | beam.FlatMap(lambda x: [x, 'any']) # The lists that are output by the lambda, are then flattened into a # collection of SIX single elements: [1, 'any', 2, 'any', 3, 'any']

126

answered Oct 20 '22 12:10

Pablo

Related questions
                            
                                Partition data coming from CSV so I can process larger patches rather then individual lines
                            
                                When does Dataflow acknowledge a message of batched items from PubSubIO?
                            
                                Source Vs PTransform
                            
                                FTP to Google Storage
                            
                                Apache Beam/Dataflow Reshuffle
                            
                                How to integrate Google Cloud SQL with Google Big Query
                            
                                How to fix Dataflow unable to serialize my DoFn?
                            
                                Dataflow Pipeline - "Processing stuck in step <STEP_NAME> for at least <TIME> without outputting or completing in state finish..."
                            
                                import apache_beam metaclass conflict
                            
                                How to combine streaming data with large history data set in Dataflow/Beam
                            
                                Using SSH Key on Dataflow workers to pull private library
                            
                                How to import a CSV file into a BigQuery table without any column names or schema?
                            
                                Google dataflow streaming pipeline is not distributing workload over several workers after windowing
                            
                                Dataflow setting Controller Service Account
                            
                                What is the watermark heuristic for PubsubIO running on GCD?
                            
                                Writing to Google Cloud Storage from PubSub using Cloud Dataflow using DoFn
                            
                                easiest way to schedule a Google Cloud Dataflow job
                            
                                How to delete a gcloud Dataflow job?
                            
                                Benefits with Dataflow over cloud functions when moving data?
                            
                                Google Dataflow vs Apache Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apache Beam : FlatMap vs Map?

Tags:

apache-beam

google-cloud-dataflow

Emma Y

People also ask

1 Answers

Pablo

Recent Activity

Donate For Us