I have a PCollection, and I would like to use a ParDo to filter out some elements from it.
Is there a place where I can find an example for this?
ParDo is the core element-wise transform in Apache Beam, invoking a user-specified function on each of the elements of the input PCollection to produce zero or more output elements, all of which are collected into the output PCollection .
Pydoc. A transform for generic parallel processing. A ParDo transform considers each element in the input PCollection , performs some processing function (your user code) on that element, and emits zero or more elements to an output PCollection . See more information in the Beam Programming Guide.
ParDo is the computational pattern of per-element computation. The DoFn , here I called it fn , is the logic that is applied to. each element. Pardo takes a DoFn subclass argument.
In the Apache Beam Python SDK, there is a Filter transform that receives a lambda, and filters out all elements that return False. Here is an example:
filtered_collection = (beam.Create([1, 2, 3, 4, 5])
                       beam.Filter(lambda x: x % 2 == 0))
In this case, filtered_collection will be a PCollection that contains 2, and 4.
If you want to code this as a DoFn that is passed to a ParDo transform, you would do something like this:
class FilteringDoFn(beam.DoFn):
  def process(self, element):
    if element % 2 == 0:
      yield element
    else:
      return  # Return nothing
and you can apply it like so:
filtered_collection = (beam.Create([1, 2, 3, 4, 5])
                       beam.ParDo(FilteringDoFn()))
where, like before, filtered_collection is a PCollection that contains 2, and 4.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With