I have two data streams that I want to combine. The problem is that one data stream has a much higher frequency than the other and there are times where one stream is not receiving events at all. Is it possible to use the last event from the one stream and join it with the other stream on every event that is coming?
The only solution I found is using the join function, but you have to specify a common window, where you can apply the join function. This is window is not reached, when one stream is not receiving any events.
Is there a possibility to apply the join function on every event that is coming from either one stream or the other and maintain state of the last consumed event and use this event for the join function?
Join in Action Now run the flink application and also tail the log to see the output. Enter messages in both of these two netcat windows within a window of 30 seconds to join both the streams. The resultant data stream has complete information of an individual-: the id, name, department, and salary.
Apache Flink's unified approach to stream and batch processing means that a DataStream application executed over bounded input will produce the same final results regardless of the configured execution mode.
Apache Flink is that real-time processing tool.
A window join adds the dimension of time into the join criteria themselves. In doing so, the window join joins the elements of two streams that share a common key and are in the same window. The semantic of window join is same to the DataStream window join.
There are many different approaches to combining or joining two streams in Flink, depending on requirements of each specific use case. When doing this "by hand",
you want to be using Flink's ConnectedStream
s with a RichCoFlatMapFunction
or CoProcessFunction
. Either of these will allow you to keep managed state (i.e. the last element from the infrequently updating stream), and join it with the faster stream. CoProcessFunction adds the ability to work with timers, which you should use to clear state for expired keys, if that's relevant.
There's an exercise on the Flink training site about different approaches for implementing such joins: Enrichment Joins. For a simpler example, see also the exercise about Expiring State.
Each recent release of Flink has included additional built-in join functions, so at this point it is less often necessary to roll your own. See the pages on joining with the DataStream API, joins with the Table API, and joins in SQL for more details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With