I want to understand in which scenario that I should use FlatMap or Map. The documentation did not seem clear to me.
I still do not understand in which scenario I should use the transformation of FlatMap or Map.
Could someone give me an example so I can understand their difference?
I understand the difference of FlatMap vs Map in Spark, and however not sure if there any similarity?
Pydoc. Applies a simple 1-to-many mapping function over each element in the collection. The many elements are flattened into the resulting collection.
In Scala, flatMap() method is identical to the map() method, but the only difference is that in flatMap the inner grouping of an item is removed and a sequence is generated. It can be defined as a blend of map method and flatten method.
PCollection : A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism.
DoFn is a Beam SDK class that describes a distributed processing function.
These transforms in Beam are exactly same as Spark (Scala too).
A Map
transform, maps from a PCollection
of N elements into another PCollection
of N elements.
A FlatMap
transform maps a PCollections
of N elements into N collections of zero or more elements, which are then flattened into a single PCollection
.
As a simple example, the following happens:
beam.Create([1, 2, 3]) | beam.Map(lambda x: [x, 'any']) # The result is a collection of THREE lists: [[1, 'any'], [2, 'any'], [3, 'any']]
Whereas:
beam.Create([1, 2, 3]) | beam.FlatMap(lambda x: [x, 'any']) # The lists that are output by the lambda, are then flattened into a # collection of SIX single elements: [1, 'any', 2, 'any', 3, 'any']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With