Both DoFn
and PTransform
is a means to define operation for PCollection
. How do we know which to use when?
DoFn is a Beam SDK class that describes a distributed processing function.
A PTransform<InputT, OutputT> is an operation that takes an InputT (some subtype of PInput ) and produces an OutputT (some subtype of POutput ). Common PTransforms include root PTransforms like TextIO.
PCollection - A PCollection is a data set or data stream. The data that a pipeline processes is part of a PCollection. PTransform - A PTransform (or transform) represents a data processing operation, or a step, in your pipeline.
ParDo is the core parallel processing operation in the Apache Beam SDKs, invoking a user-specified function on each of the elements of the input PCollection . ParDo collects the zero or more output elements into an output PCollection . The ParDo transform processes elements independently and possibly in parallel.
A simple way to understand it is by analogy with map(f)
for lists:
map
applies a function to each element of a list, returning a new list of the results. You might call it a computational pattern.f
is the logic applied to each element.Now, switching to talk about Beam specifics, I think you are asking about ParDo.of(fn)
, which is a PTransform
.
PTransform
is an operation that takes PCollections
as input and yields PCollections
as output. Beam has just five primitive types of PTransform
, encapsulating embarrassingly parallel computational patterns.ParDo
is the computational pattern of per-element computation. It has some variations, but you don't need to worry about that for this question.DoFn
, here I called it fn
, is the logic that is applied to each element.It may also help to think of the fact that you write a DoFn
to say what to do on each element, and the Beam runner provides the ParDo
to apply your logic.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With