I am writing a Flink streaming program in which I need to enrich a DataStream of user events using some static data set (information base, IB).
For E.g. Let's say we have a static data set of buyers and we have an incoming clickstream of events, for each event we want to add a boolean flag indicating whether the doer of the event is a buyer or not.
An ideal way to achieve this would be to partition the incoming stream by user id, have the buyers set available in a DataSet partitioned again by user id and then do a look up for each event in the stream into this DataSet.
Since Flink does not allow using DataSets in a streaming program, how can I achieve the above ?
Another option could be to use Managed Operator State to store buyers set, but how can I keep this state distributed by user id so as to avoid network i/o in individual event look ups ? In case of memory state backend, does state remain distributed by some key, or is it replicated across all operator subtasks ?
What is the right design pattern to achieve the above enriching requirement in a Flink streaming program ?
The two main data abstractions of Flink are DataStream and DataSet, they represent read-only collections of data elements. The list of elements is bounded (i.e., finite) in DataSet, while it is unbounded (i.e., infinite) in the case of DataStream.
Flink Streaming uses the pipelined Flink engine to process data streams in real time and offers a new API including definition of flexible windows. In this post, we go through an example that uses the Flink Streaming API to compute statistics on stock market data that arrive continuously and combine...
Flink, of course, has support for reading in streams from external sources such as Apache Kafka, Apache Flume, RabbitMQ, and others. For the sake of this example, the data streams are simply generated using the generateStock method: To read from the text socket stream please make sure that you have a socket running.
Moving towards more advanced features, we compute rolling correlations between the market data streams and a Twitter stream with stock mentions. For running the example implementation please use the 0.9-SNAPSHOT version of Flink as a dependency. The full example code base can be found here in Scala and here in Java7.
I would key the stream by user_id, and use a RichFlatMap to do the enrichment. In the open() method of the RichFlatMap you can load the static buyer flag for that user and keep it cached in a boolean field.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With