Apache Flink: What's the difference between side outputs and split() in the DataStream API?

Tags:

flink-streaming

Apache Flink has a split API that lets to branch data-streams:

val splited = datastream.split { i => i match {
   case i if ... => Seq("red", "blue")
   case _ => Seq("green")
}}

splited.select("green").flatMap { .... }

It also provides a another approach called Side Output( https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/stream/side_output.html) that lets you do the same thing!

What's the difference between these two way? Do they use from a same lower-level construction? Do they cost the same? When and how we should select one of them?

881

asked Jul 20 '18 10:07

2 Answers

One important difference between split and side outputs is that split is deprecated while side outputs are not.

Quote from Flink's split manual:

split(OutputSelector<T> outputSelector)
Deprecated. 
Please use side output instead.

answered Sep 19 '22 18:09

patryk.beza

The split operator is part of the DataStream API since its early days. The side output feature as added later and offers a superset of split's functionality.

split creates multiple streams of the same type, the input type. Side outputs can be of any type, i.e., also different from the input and the main output.

Internally, split adds dedicated operator that just splits the stream. Side outputs are defined within an operator (typically a ProcessFunction or window operator) that apply arbitrary logic and feature multiple outputs. I would not expect this to result in a significant performance difference.

A common use case for side outputs is to filter out invalid (or late) records and pass them unmodified to the side, e.g., to process them later. Such an operator has a regular output with the desired result type and a side output with its input type. This logic would be cumbersome to implement using split.

128

answered Sep 18 '22 18:09

Fabian Hueske

Related questions
                            
                                How to Handle Application Errors in Flink
                            
                                Akka version collision between Flink and Play 2.5
                            
                                Flink job started from another program on YARN fails with "JobClientActor seems to have died"
                            
                                How to fix: java.lang.OutOfMemoryError: Direct buffer memory in flink kafka consumer
                            
                                Apache Flink - Unable to use local Kinesis for FlinkKinesisConsumer
                            
                                Kappa architecture: when insert to batch/analytic serving layer happens
                            
                                Flink SVM 90% misclassification
                            
                                Running Apache Beam python pipelines in Kubernetes
                            
                                Flink Streaming: How to implement windows which are defined by a start and end element?
                            
                                Flink - No FileSystem for scheme: hdfs
                            
                                Apache Flink: guideliness for setting parallelism?
                            
                                Get JSON elements from a web with Apache Flink
                            
                                BZip2 compressed input for Apache Flink
                            
                                flink - adding instrumentation
                            
                                Flink: How to convert the deprecated fold to aggregrate?
                            
                                Apache Flink DataStream API doesn't have a mapPartition transformation
                            
                                Flink Dynamic Table vs Kafka Stream Ktable?
                            
                                Executing Sample Flink Program in Local
                            
                                How to change Flink's log directory
                            
                                ClassNotFoundException: org.apache.flink.streaming.api.checkpoint.CheckpointNotifier while consuming a kafka topic

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apache Flink: What's the difference between side outputs and split() in the DataStream API?

Tags:

apache-flink

flink-streaming

Reza Same'ei

People also ask

2 Answers

patryk.beza

Fabian Hueske

Recent Activity

Donate For Us