Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combine two streams in Apache Flink regardless on window time

I have two data streams that I want to combine. The problem is that one data stream has a much higher frequency than the other and there are times where one stream is not receiving events at all. Is it possible to use the last event from the one stream and join it with the other stream on every event that is coming?

The only solution I found is using the join function, but you have to specify a common window, where you can apply the join function. This is window is not reached, when one stream is not receiving any events.

Is there a possibility to apply the join function on every event that is coming from either one stream or the other and maintain state of the last consumed event and use this event for the join function?

like image 278
FLoppix Avatar asked Sep 02 '17 14:09

FLoppix


People also ask

How do I combine two streams in Flink?

Join in Action Now run the flink application and also tail the log to see the output. Enter messages in both of these two netcat windows within a window of 30 seconds to join both the streams. The resultant data stream has complete information of an individual-: the id, name, department, and salary.

Can Flink do batch processing?

Apache Flink's unified approach to stream and batch processing means that a DataStream application executed over bounded input will produce the same final results regardless of the configured execution mode.

Is Flink real-time?

Apache Flink is that real-time processing tool.

What is a window join?

A window join adds the dimension of time into the join criteria themselves. In doing so, the window join joins the elements of two streams that share a common key and are in the same window. The semantic of window join is same to the DataStream window join.


1 Answers

There are many different approaches to combining or joining two streams in Flink, depending on requirements of each specific use case. When doing this "by hand", you want to be using Flink's ConnectedStreams with a RichCoFlatMapFunction or CoProcessFunction. Either of these will allow you to keep managed state (i.e. the last element from the infrequently updating stream), and join it with the faster stream. CoProcessFunction adds the ability to work with timers, which you should use to clear state for expired keys, if that's relevant.

There's an exercise on the Flink training site about different approaches for implementing such joins: Enrichment Joins. For a simpler example, see also the exercise about Expiring State.

Each recent release of Flink has included additional built-in join functions, so at this point it is less often necessary to roll your own. See the pages on joining with the DataStream API, joins with the Table API, and joins in SQL for more details.

like image 171
David Anderson Avatar answered Oct 12 '22 08:10

David Anderson