Apache Flink has a split
API that lets to branch data-streams:
val splited = datastream.split { i => i match {
case i if ... => Seq("red", "blue")
case _ => Seq("green")
}}
splited.select("green").flatMap { .... }
It also provides a another approach called Side Output( https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/stream/side_output.html) that lets you do the same thing!
What's the difference between these two way? Do they use from a same lower-level construction? Do they cost the same? When and how we should select one of them?
Join in Action Now run the flink application and also tail the log to see the output. Enter messages in both of these two netcat windows within a window of 30 seconds to join both the streams. The resultant data stream has complete information of an individual-: the id, name, department, and salary.
Datastream provides a REST API for administering your private connectivity configurations, connection profiles, and streams programmatically. The REST API is defined by resources associated with creating and managing private connectivity configurations, connection profiles, and streams.
Flink is designed to run stateful streaming applications at any scale. Applications are parallelized into possibly thousands of tasks that are distributed and concurrently executed in a cluster. Therefore, an application can leverage virtually unlimited amounts of CPUs, main memory, disk and network IO.
KeyBy is one of the mostly used transformation operator for data streams. It is used to partition the data stream based on certain properties or keys of incoming data objects in the stream. Once we apply the keyBy, all the data objects with same type of keys are grouped together.
One important difference between split
and side outputs is that split
is deprecated while side outputs are not.
Quote from Flink's split
manual:
split(OutputSelector<T> outputSelector)
Deprecated.
Please use side output instead.
The split
operator is part of the DataStream API since its early days. The side output feature as added later and offers a superset of split
's functionality.
split
creates multiple streams of the same type, the input type. Side outputs can be of any type, i.e., also different from the input and the main output.
Internally, split
adds dedicated operator that just splits the stream. Side outputs are defined within an operator (typically a ProcessFunction
or window operator) that apply arbitrary logic and feature multiple outputs. I would not expect this to result in a significant performance difference.
A common use case for side outputs is to filter out invalid (or late) records and pass them unmodified to the side, e.g., to process them later. Such an operator has a regular output with the desired result type and a side output with its input type. This logic would be cumbersome to implement using split
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With