Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performing an asynchronous transformation within a Kafka Stream

Assume I have two Kafka topics, A and B. I am trying to develop a system that pulls records from A, applies a transformation to each record, then publishes the transformed records to B. In this case, the transformation involves calling a REST endpoint over HTTP.

Being relatively new to Kafka, I was glad to see that the Kafka Streams project already solved this type of problem (consume-transform-publish). Unfortunately, I discovered that transformations in Kafka streams are blocking operations. Instinctively, I try to call HTTP endpoints in a non-blocking, asynchronous manner.

Does this mean that Kafka Streams will not work in this situation? Does this mean that I must revert back to calling the REST endpoint in a blocking manner? Is this even an acceptable pattern for Kafka Streams? Stream-based data processing is still relatively new to me, so I am not entirely familiar with its concurrency models.

like image 441
Adam Paynter Avatar asked Jun 11 '16 13:06

Adam Paynter


People also ask

Can Kafka be used for stream processing?

Kafka Streams is a client library for processing and analyzing data stored in Kafka. It builds upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple yet efficient management and real-time querying of application state.

What is KStream and KTable in Kafka?

KStream, KTable and GlobalKTable. Kafka Streams provides two abstractions for Streams and Tables. KStream handles the stream of records. On the other hand, KTable manages the changelog stream with the latest state of a given key. Each data record represents an update.

What is difference between Kafka connect and Kafka Streams?

Kafka Streams is an API for writing client applications that transform data in Apache Kafka. You usually do this by publishing the transformed data onto a new topic. The data processing itself happens within your client application, not on a Kafka broker. Kafka Connect is an API for moving data into and out of Kafka.


1 Answers

Update: after looking in to this further, I am not sure that this is the right answer...


I am new to Kafka and Kafka Streams (hereafter referred to as "Kafka"), but having encountered and considered similar questions, here is my perspective:

Kafka has two salient features:

  1. All parallelism is achieved through the partitioning of topics
  2. Within a partition of a topic, processing is strongly ordered, one-at-a-time.

Many really nice properties fall out from these features. For example, stream-based "transactions", I think, is one of the coolest.

But whether these properties are actually "features" in the sense that you want them, of course, depends on the application. If you don't want strongly ordered processing with parallelism based on topic partitioning, then you might not want to be using Kafka for that application.

So, with regard to:

Does this mean that Kafka Streams will not work in this situation?

It will work, but increased parallelism is achieved through increased partitioning.

Does this mean that I must revert back to calling the REST endpoint in a blocking manner?

Yes, I think it does—but I'm not sure why that would be a "reversion". Personally, that's what I like about Kafka: blocking code is simpler. If I want more parallelism, I can run more threads. There's no shared state, after all.

like image 86
Dmitry Minkovsky Avatar answered Oct 03 '22 05:10

Dmitry Minkovsky