Assume I have two Kafka topics, A and B. I am trying to develop a system that pulls records from A, applies a transformation to each record, then publishes the transformed records to B. In this case, the transformation involves calling a REST endpoint over HTTP.
Being relatively new to Kafka, I was glad to see that the Kafka Streams project already solved this type of problem (consume-transform-publish). Unfortunately, I discovered that transformations in Kafka streams are blocking operations. Instinctively, I try to call HTTP endpoints in a non-blocking, asynchronous manner.
Does this mean that Kafka Streams will not work in this situation? Does this mean that I must revert back to calling the REST endpoint in a blocking manner? Is this even an acceptable pattern for Kafka Streams? Stream-based data processing is still relatively new to me, so I am not entirely familiar with its concurrency models.
Kafka Streams is a client library for processing and analyzing data stored in Kafka. It builds upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple yet efficient management and real-time querying of application state.
KStream, KTable and GlobalKTable. Kafka Streams provides two abstractions for Streams and Tables. KStream handles the stream of records. On the other hand, KTable manages the changelog stream with the latest state of a given key. Each data record represents an update.
Kafka Streams is an API for writing client applications that transform data in Apache Kafka. You usually do this by publishing the transformed data onto a new topic. The data processing itself happens within your client application, not on a Kafka broker. Kafka Connect is an API for moving data into and out of Kafka.
Update: after looking in to this further, I am not sure that this is the right answer...
I am new to Kafka and Kafka Streams (hereafter referred to as "Kafka"), but having encountered and considered similar questions, here is my perspective:
Kafka has two salient features:
Many really nice properties fall out from these features. For example, stream-based "transactions", I think, is one of the coolest.
But whether these properties are actually "features" in the sense that you want them, of course, depends on the application. If you don't want strongly ordered processing with parallelism based on topic partitioning, then you might not want to be using Kafka for that application.
So, with regard to:
Does this mean that Kafka Streams will not work in this situation?
It will work, but increased parallelism is achieved through increased partitioning.
Does this mean that I must revert back to calling the REST endpoint in a blocking manner?
Yes, I think it does—but I'm not sure why that would be a "reversion". Personally, that's what I like about Kafka: blocking code is simpler. If I want more parallelism, I can run more threads. There's no shared state, after all.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With