Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Enriching DataStream using static DataSet in Flink streaming

I am writing a Flink streaming program in which I need to enrich a DataStream of user events using some static data set (information base, IB).

For E.g. Let's say we have a static data set of buyers and we have an incoming clickstream of events, for each event we want to add a boolean flag indicating whether the doer of the event is a buyer or not.

An ideal way to achieve this would be to partition the incoming stream by user id, have the buyers set available in a DataSet partitioned again by user id and then do a look up for each event in the stream into this DataSet.

Since Flink does not allow using DataSets in a streaming program, how can I achieve the above ?

Another option could be to use Managed Operator State to store buyers set, but how can I keep this state distributed by user id so as to avoid network i/o in individual event look ups ? In case of memory state backend, does state remain distributed by some key, or is it replicated across all operator subtasks ?

What is the right design pattern to achieve the above enriching requirement in a Flink streaming program ?

like image 502
Vijay Kansal Avatar asked Apr 04 '18 07:04

Vijay Kansal


People also ask

What is dataset and datastream in Flink?

The two main data abstractions of Flink are DataStream and DataSet, they represent read-only collections of data elements. The list of elements is bounded (i.e., finite) in DataSet, while it is unbounded (i.e., infinite) in the case of DataStream.

How does Flink streaming work?

Flink Streaming uses the pipelined Flink engine to process data streams in real time and offers a new API including definition of flexible windows. In this post, we go through an example that uses the Flink Streaming API to compute statistics on stock market data that arrive continuously and combine...

How do I read data from external sources in Flink?

Flink, of course, has support for reading in streams from external sources such as Apache Kafka, Apache Flume, RabbitMQ, and others. For the sake of this example, the data streams are simply generated using the generateStock method: To read from the text socket stream please make sure that you have a socket running.

Can Flink compute rolling correlations between market data streams and Twitter mentions?

Moving towards more advanced features, we compute rolling correlations between the market data streams and a Twitter stream with stock mentions. For running the example implementation please use the 0.9-SNAPSHOT version of Flink as a dependency. The full example code base can be found here in Scala and here in Java7.


1 Answers

I would key the stream by user_id, and use a RichFlatMap to do the enrichment. In the open() method of the RichFlatMap you can load the static buyer flag for that user and keep it cached in a boolean field.

like image 155
David Anderson Avatar answered Sep 21 '22 15:09

David Anderson