Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Flink: Enrich stream with data from external/blocking call

In my application, I want to enrich an infinite stream of events. The stream itself is parallelised by hashing of an Id. For every event, there might be a call to an external source (e.g. REST, DB). This call is blocking by nature. The order of events within one stream partition must be maintained.

My idea was to create a RichMapFunction, which sets up the connection and then polls the external source for each event. The blocking call usually takes not to long, but in the worst case, the service could be down.

Theoretically, this works, but I don't feel good doing it this way, as I don't know how Flink reacts if you have some blocking operations within the stream. And what happens if you have a lot of parallel streams blocking, i.e. am I running out of threads? Or how is the behavior stream-upwards at the point where the stream is parallelised?

Does someone else may have a similar issue and an answer to my question or some ideas how to tackle it?

like image 589
peterschrott Avatar asked Jun 16 '16 08:06

peterschrott


People also ask

Does Netflix use Flink?

Netflix engineers recently published how they built Studio Search, using Apache Kafka streams, an Apache Flink-based Data Mesh process, and an Elasticsearch sink to manage the index.

Why Flink is faster than spark?

The main reason for this is its stream processing feature, which manages to process rows upon rows of data in real time – which is not possible in Apache Spark's batch processing method. This makes Flink faster than Spark.

Is Flink A ETL?

Flink enables real-time data analytics on streaming data and fits well for continuous Extract-transform-load (ETL) pipelines on streaming data and for event-driven applications as well.

What is KeyBy in Flink?

According to the Apache Flink documentation, KeyBy transformation logically partitions a stream into disjoint partitions. All records with the same key are assigned to the same partition.


1 Answers

RichMapFunction is a good starting point but prefer RichAsyncFunctionwhich is asynchronous and which not block your processing !

Careful :
1- your DB access but also be asynchronous
2- your event order may change (according to the mode used)

More details : https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/asyncio.html

Hope it helps

like image 195
Eric Taix Avatar answered Sep 27 '22 23:09

Eric Taix