Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Beam: DoFn vs PTransform

Both DoFn and PTransform is a means to define operation for PCollection. How do we know which to use when?

like image 540
user_1357 Avatar asked Dec 08 '17 01:12

user_1357


People also ask

What is DoFn in Apache Beam?

DoFn is a Beam SDK class that describes a distributed processing function.

What is PTransform in Apache Beam?

A PTransform<InputT, OutputT> is an operation that takes an InputT (some subtype of PInput ) and produces an OutputT (some subtype of POutput ). Common PTransforms include root PTransforms like TextIO.

What does PCollection stand for?

PCollection - A PCollection is a data set or data stream. The data that a pipeline processes is part of a PCollection. PTransform - A PTransform (or transform) represents a data processing operation, or a step, in your pipeline.

What is ParDo function?

ParDo is the core parallel processing operation in the Apache Beam SDKs, invoking a user-specified function on each of the elements of the input PCollection . ParDo collects the zero or more output elements into an output PCollection . The ParDo transform processes elements independently and possibly in parallel.


1 Answers

A simple way to understand it is by analogy with map(f) for lists:

  • The higher-order function map applies a function to each element of a list, returning a new list of the results. You might call it a computational pattern.
  • The function f is the logic applied to each element.

Now, switching to talk about Beam specifics, I think you are asking about ParDo.of(fn), which is a PTransform.

  • A PTransform is an operation that takes PCollections as input and yields PCollections as output. Beam has just five primitive types of PTransform, encapsulating embarrassingly parallel computational patterns.
  • ParDo is the computational pattern of per-element computation. It has some variations, but you don't need to worry about that for this question.
  • The DoFn, here I called it fn, is the logic that is applied to each element.

It may also help to think of the fact that you write a DoFn to say what to do on each element, and the Beam runner provides the ParDo to apply your logic.

like image 148
Kenn Knowles Avatar answered Oct 23 '22 11:10

Kenn Knowles