Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Beam: What is the difference between DoFn and SimpleFunction?

While reading about processing streaming elements in apache beam using Java, I came across DoFn<InputT, OutputT> and then across SimpleFunction<InputT, OutputT>.

Both of these look similar to me and I find it difficult to understand the difference.

Can someone explain the difference in layman terms?

like image 620
kaxil Avatar asked May 25 '18 09:05

kaxil


People also ask

What is DoFn in Apache Beam?

DoFn is a Beam SDK class that describes a distributed processing function.

What is ParDo and DoFn?

ParDo is the computational pattern of per-element computation. It has some variations, but you don't need to worry about that for this question. The DoFn , here I called it fn , is the logic that is applied to each element.

What is Splittable DoFn?

Splittable DoFn (SDF) is a generalization of DoFn that gives it the core capabilities of Source while retaining DoFn 's syntax, flexibility, modularity, and ease of coding. As a result, it becomes possible to develop more powerful IO connectors than before, with shorter, simpler, more reusable code.

What is PCollection and PTransform in dataflow?

A PCollection can contain either a bounded or unbounded number of elements. Bounded and unbounded PCollections are produced as the output of PTransforms (including root PTransforms like Read and Create ), and can be passed as the inputs of other PTransforms.


1 Answers

Conceptually you can think of SimpleFunction is a simple case of DoFn:

  • SimpleFunction<InputT, OutputT>:

    • simple input to output mapping function;
    • single input produces single output;
    • statically typed, you have to @Override the apply() method;
    • doesn't depend on computation context;
    • can't use Beam state APIs;
    • example use case: MapElements.via(simpleFunction) to convert/modify elements one by one, producing one output for each element;
  • DoFn<InputT, OutputT>:

    • executed with ParDo;
    • exposed to the context (timestamp, window pane, etc);
    • can consume side inputs;
    • can produce multiple outputs or no outputs at all;
    • can produce side outputs;
    • can use Beam's persistent state APIs;
    • dynamically typed;
    • example use case: read objects from a stream, filter, accumulate them, perform aggregations, convert them, and dispatch to different outputs;

You can find more specific examples and use cases for ParDos in the dev guide.

This part mentions the MapElements, which is the use case for SimpleFunctions

like image 57
Anton Avatar answered Oct 24 '22 01:10

Anton