The join and coGroup transformation can read 2 input datasets and output one ("Y" flux) (correct me if I'm wrong). I would like to process and update 2 datasets. To do this, I plan to use 2 <code>coGroup</code> transformations. But, for performance purpose, can these both transformations be done in a single one ("H" flux)? Also, as the datasets are updated, I would like to iterate over them. If it's not currently possible, do you plan to support this kind of transformation in the future?

All Flink DataSet operators support only a single output, but the output of an operator can be consumed by two or more following operators. There are two ways to solve your issue: <ol> <li>Use a single CoGroup to compute the result for both outputs and add two Filters that filter out the records of one of both outputs. If both outputs have different data types, you need to compute return something like <code>Tuple2<FirstType, SecondType></code>. This solution would look like:</li> </ol> <pre class="prettyprint"> input1--\ /--> Filter_output1 CoGroup input2--/ \--> Filter_output2 </pre> <ol start="2"> <li>Partition and sort both CoGroup inputs on the grouping key and call two individual CoGroups. Each CoGroup computes one output. By sorting the data before the CoGroup, the partitioning and sorting can be reused. Important, all operators must use the same parallelism!</li> </ol> <pre class="prettyprint"> input1 --> PartitionHash --> SortPartition -\-/-> CoGroup1 --> Output1 X input2 --> PartitionHash --> SortPartition -/-\-> CoGroup2 --> Output2 </pre> Regarding the iterations, have a look at Flink's iteration operators.

Flink : How to process and output two datasets in a single transformation?

1 Answers

All Flink DataSet operators support only a single output, but the output of an operator can be consumed by two or more following operators.

There are two ways to solve your issue:

Use a single CoGroup to compute the result for both outputs and add two Filters that filter out the records of one of both outputs. If both outputs have different data types, you need to compute return something like Tuple2<FirstType, SecondType>. This solution would look like:

    input1--\         /--> Filter_output1 
              CoGroup 
    input2--/         \--> Filter_output2

Partition and sort both CoGroup inputs on the grouping key and call two individual CoGroups. Each CoGroup computes one output. By sorting the data before the CoGroup, the partitioning and sorting can be reused. Important, all operators must use the same parallelism!

    input1 --> PartitionHash --> SortPartition -\-/-> CoGroup1 --> Output1
                                                 X
    input2 --> PartitionHash --> SortPartition -/-\-> CoGroup2 --> Output2

Regarding the iterations, have a look at Flink's iteration operators.

175

answered Dec 03 '22 05:12

Fabian Hueske

Related questions
                            
                                ClassNotFoundException: org.apache.flink.streaming.api.checkpoint.CheckpointNotifier while consuming a kafka topic
                            
                                Apache Flink: What's the difference between side outputs and split() in the DataStream API?
                            
                                Measure job execution time in flink
                            
                                Kafka & Flink duplicate messages on restart
                            
                                The benefits of Flink Kafka Stream over Spark Kafka Stream? And Kafka Stream over Flink? [closed]
                            
                                How to sort a dataset in Apache Flink?
                            
                                How to support multiple KeyBy in Flink
                            
                                What is the status of FlinkML?
                            
                                In Flink, how to write DataStream to single file?
                            
                                Difference between shuffle() and rebalance() in Apache Flink
                            
                                Difference between job, task and subtask in flink
                            
                                upgraded flink from 1.10 to 1.11, met error 'No ExecutorFactory found to execute the application'
                            
                                Accessing Flink Classloader before Stream Start
                            
                                Flink Job suddenly crashed with error: Encountered error while consuming partitions
                            
                                Why would someone run Spark / Flink on Tez?
                            
                                Apache Flink - custom java options are not recognized inside job
                            
                                How to log uncaught exceptions during Flink job execution
                            
                                Apache Flink: NullPointerException caused by TupleSerializer
                            
                                How to flatMap a function on GroupedDataSet in Apache Flink

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Flink : How to process and output two datasets in a single transformation?

Tags:

apache-flink

Ghislain Viguier

People also ask

1 Answers

Fabian Hueske

Recent Activity

Donate For Us