While reading about processing streaming elements in apache beam using Java, I came across DoFn<InputT, OutputT>
and then across SimpleFunction<InputT, OutputT>
.
Both of these look similar to me and I find it difficult to understand the difference.
Can someone explain the difference in layman terms?
DoFn is a Beam SDK class that describes a distributed processing function.
ParDo is the computational pattern of per-element computation. It has some variations, but you don't need to worry about that for this question. The DoFn , here I called it fn , is the logic that is applied to each element.
Splittable DoFn (SDF) is a generalization of DoFn that gives it the core capabilities of Source while retaining DoFn 's syntax, flexibility, modularity, and ease of coding. As a result, it becomes possible to develop more powerful IO connectors than before, with shorter, simpler, more reusable code.
A PCollection can contain either a bounded or unbounded number of elements. Bounded and unbounded PCollections are produced as the output of PTransforms (including root PTransforms like Read and Create ), and can be passed as the inputs of other PTransforms.
Conceptually you can think of SimpleFunction
is a simple case of DoFn
:
SimpleFunction<InputT, OutputT>
:
@Override
the apply()
method;MapElements.via(simpleFunction)
to convert/modify elements one by one, producing one output for each element;DoFn<InputT, OutputT>
:
ParDo
;You can find more specific examples and use cases for ParDos
in the dev guide.
This part mentions the MapElements
, which is the use case for SimpleFunctions
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With