The join and coGroup transformation can read 2 input datasets and output one ("Y" flux) (correct me if I'm wrong).
I would like to process and update 2 datasets. To do this, I plan to use 2 coGroup
transformations.
But, for performance purpose, can these both transformations be done in a single one ("H" flux)?
Also, as the datasets are updated, I would like to iterate over them. If it's not currently possible, do you plan to support this kind of transformation in the future?
It can apply different kinds of transformations on the datasets like filtering, mapping, aggregating, joining and grouping.
According to the Apache Flink documentation, KeyBy transformation logically partitions a stream into disjoint partitions. All records with the same key are assigned to the same partition.
This connector provides a Sink that writes partitioned files to filesystems supported by the Flink FileSystem abstraction. The streaming file sink writes incoming data into buckets. Given that the incoming streams can be unbounded, data in each bucket are organized into part files of finite size.
Using keyed streams - Flink TutorialFlink distributes the events in a data stream to different task slots based on the key. Flink users are hashing algorithms to divide the stream by partitions based on the number of slots allocated to the job. It then distributes the same keys to the same slots.
All Flink DataSet operators support only a single output, but the output of an operator can be consumed by two or more following operators.
There are two ways to solve your issue:
Tuple2<FirstType, SecondType>
. This solution would look like:input1--\ /--> Filter_output1 CoGroup input2--/ \--> Filter_output2
input1 --> PartitionHash --> SortPartition -\-/-> CoGroup1 --> Output1 X input2 --> PartitionHash --> SortPartition -/-\-> CoGroup2 --> Output2
Regarding the iterations, have a look at Flink's iteration operators.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With