Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Flink: What's the difference between side outputs and split() in the DataStream API?

Apache Flink has a split API that lets to branch data-streams:

val splited = datastream.split { i => i match {
   case i if ... => Seq("red", "blue")
   case _ => Seq("green")
}}

splited.select("green").flatMap { .... }

It also provides a another approach called Side Output( https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/stream/side_output.html) that lets you do the same thing!

What's the difference between these two way? Do they use from a same lower-level construction? Do they cost the same? When and how we should select one of them?

like image 881
Reza Same'ei Avatar asked Jul 20 '18 10:07

Reza Same'ei


People also ask

How do I combine two streams in Flink?

Join in Action Now run the flink application and also tail the log to see the output. Enter messages in both of these two netcat windows within a window of 30 seconds to join both the streams. The resultant data stream has complete information of an individual-: the id, name, department, and salary.

What is DataStream API?

Datastream provides a REST API for administering your private connectivity configurations, connection profiles, and streams programmatically. The REST API is defined by resources associated with creating and managing private connectivity configurations, connection profiles, and streams.

How does Flink streaming work?

Flink is designed to run stateful streaming applications at any scale. Applications are parallelized into possibly thousands of tasks that are distributed and concurrently executed in a cluster. Therefore, an application can leverage virtually unlimited amounts of CPUs, main memory, disk and network IO.

How does KeyBy work Flink?

KeyBy is one of the mostly used transformation operator for data streams. It is used to partition the data stream based on certain properties or keys of incoming data objects in the stream. Once we apply the keyBy, all the data objects with same type of keys are grouped together.


2 Answers

One important difference between split and side outputs is that split is deprecated while side outputs are not.

Quote from Flink's split manual:

split(OutputSelector<T> outputSelector)
Deprecated. 
Please use side output instead.
like image 31
patryk.beza Avatar answered Sep 19 '22 18:09

patryk.beza


The split operator is part of the DataStream API since its early days. The side output feature as added later and offers a superset of split's functionality.

split creates multiple streams of the same type, the input type. Side outputs can be of any type, i.e., also different from the input and the main output.

Internally, split adds dedicated operator that just splits the stream. Side outputs are defined within an operator (typically a ProcessFunction or window operator) that apply arbitrary logic and feature multiple outputs. I would not expect this to result in a significant performance difference.

A common use case for side outputs is to filter out invalid (or late) records and pass them unmodified to the side, e.g., to process them later. Such an operator has a regular output with the desired result type and a side output with its input type. This logic would be cumbersome to implement using split.

like image 128
Fabian Hueske Avatar answered Sep 18 '22 18:09

Fabian Hueske