Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Flink : How to process and output two datasets in a single transformation?

Tags:

apache-flink

The join and coGroup transformation can read 2 input datasets and output one ("Y" flux) (correct me if I'm wrong).

I would like to process and update 2 datasets. To do this, I plan to use 2 coGroup transformations.

But, for performance purpose, can these both transformations be done in a single one ("H" flux)?

Also, as the datasets are updated, I would like to iterate over them. If it's not currently possible, do you plan to support this kind of transformation in the future?

like image 855
Ghislain Viguier Avatar asked Aug 18 '15 08:08

Ghislain Viguier


People also ask

Which of the following transformation on datasets are supported by Apache Flink?

It can apply different kinds of transformations on the datasets like filtering, mapping, aggregating, joining and grouping.

What is KeyBy in Flink?

According to the Apache Flink documentation, KeyBy transformation logically partitions a stream into disjoint partitions. All records with the same key are assigned to the same partition.

What is sink in Flink?

This connector provides a Sink that writes partitioned files to filesystems supported by the Flink FileSystem abstraction. The streaming file sink writes incoming data into buckets. Given that the incoming streams can be unbounded, data in each bucket are organized into part files of finite size.

What is keyed stream in Flink?

Using keyed streams - Flink TutorialFlink distributes the events in a data stream to different task slots based on the key. Flink users are hashing algorithms to divide the stream by partitions based on the number of slots allocated to the job. It then distributes the same keys to the same slots.


1 Answers

All Flink DataSet operators support only a single output, but the output of an operator can be consumed by two or more following operators.

There are two ways to solve your issue:

  1. Use a single CoGroup to compute the result for both outputs and add two Filters that filter out the records of one of both outputs. If both outputs have different data types, you need to compute return something like Tuple2<FirstType, SecondType>. This solution would look like:
    input1--\         /--> Filter_output1 
              CoGroup 
    input2--/         \--> Filter_output2
  1. Partition and sort both CoGroup inputs on the grouping key and call two individual CoGroups. Each CoGroup computes one output. By sorting the data before the CoGroup, the partitioning and sorting can be reused. Important, all operators must use the same parallelism!
    input1 --> PartitionHash --> SortPartition -\-/-> CoGroup1 --> Output1
                                                 X
    input2 --> PartitionHash --> SortPartition -/-\-> CoGroup2 --> Output2

Regarding the iterations, have a look at Flink's iteration operators.

like image 175
Fabian Hueske Avatar answered Dec 03 '22 05:12

Fabian Hueske