Kafka Streams and RPC: is calling REST service in map() operator considered an anti-pattern?

Tags:

The naive approach for implementing the use case of enriching an incoming stream of events stored in Kafka with reference data - is by calling in map() operator an external service REST API that provides this reference data, for each incoming event.

eventStream.map((key, event) -> /* query the external service here, then return the enriched event */)

Another approach is to have second events stream with reference data and store it in KTable that will be a lightweight embedded "database" then join main event stream with it.

KStream<String, Object> eventStream = builder.stream(..., "event-topic");
KTable<String, Object> referenceDataTable = builder.table(..., "reference-data-topic");
KTable<String, Object> enrichedEventStream = eventStream 
    .leftJoin(referenceDataTable , (event, referenceData) -> /* return the enriched event */)
    .map((key, enrichedEvent) -> new KeyValue<>(/* new key */, enrichedEvent)
    .to("enriched-event-topic", ...);

Can the "naive" approach be considered an anti-pattern? Can the "KTable" approach be recommended as the preferred one?

Kafka can easily manage millions of messages per minute. Service that is called from the map() operator should be capable of handling high load too and also highly-available. These are extra requirements for the service implementation. But if the service satisfies these criteria can the "naive" approach be used?

651

asked Apr 10 '18 15:04

Evgeniy Khyst

1 Answers

Yes, it is ok to do RPC inside Kafka Streams operations such as map() operation. You just need to be aware of the pros and cons of doing so, see below. Also, you should do any such RPC calls synchronously from within your operations (I won't go into details here why; if needed, I'd suggest to create a new question).

Pros of doing RPC calls from within Kafka Streams operations:

Your application will fit more easily into an existing architecture, e.g. one where the use of REST APIs and request/response paradigms is common place. This means that you can make more progress quickly for a first proof-of-concept or MVP.
The approach is, in my experience, easier to understand for many developers (particularly those who are just starting out with Kafka) because they are familiar with doing RPC calls in this manner from their past projects. Think: it helps to move gradually from request-response architectures to event-driven architectures (powered by Kafka).
Nothing prevents you from starting with RPC calls and request-response, and then later migrating to a more Kafka-idiomatic approach.

Cons:

You are coupling the availability, scalability, and latency/throughput of your Kafka Streams powered application to the availability, scalability, and latency/throughput of the RPC service(s) you are calling. This is relevant also for thinking about SLAs.
Related to the previous point, Kafka and Kafka Streams scale very well. If you are running at large scale, your Kafka Streams application might end up DDoS'ing your RPC service(s) because the latter probably can't scale as much as Kafka. You should be able to judge pretty easily whether or not this is a problem for you in practice.
An RPC call (like from within map()) is a side-effect and thus a black box for Kafka Streams. The processing guarantees of Kafka Streams do not extend to such side effects.
- Example: Kafka Streams (by default) processes data based on event-time (= based on when an event happened in the real world), so you can easily re-process old data and still get back the same results as when the old data was still new. But the RPC service you are calling during such reprocessing might return a different response than "back then". Ensuring the latter is your responsibility.
- Example: In the case of failures, Kafka Streams will retry operations, and it will guarantee exactly-once processing (if enabled) even in such situations. But it can't guarantee, by itself, that an RPC call you are doing from within map() will be idempotent. Ensuring the latter is your responsibility.

Alternatives

In case you are wondering what other alternatives you have: If, for example, you are doing RPC calls for looking up data (e.g. for enriching an incoming stream of events with side/context information), you can address the downsides above by making the lookup data available in Kafka directly. If the lookup data is in MySQL, you can setup a Kafka connector to continuously ingest the MySQL data into a Kafka topic (think: CDC). In Kafka Streams, you can then read the lookup data into a KTable and perform the enrichment of your input stream via a stream-table join.

answered Sep 21 '22 12:09

Michael G. Noll

Related questions
                            
                                BrokerNotAvailableError: Could not find the leader Exception while Spark Streaming
                            
                                Balancing Kafka consumers
                            
                                How to send message to a particular partition in Kafka?
                            
                                Kafka - Producer Acknowledgement
                            
                                How to get message by key from kafka topic
                            
                                Having a Kafka Consumer read a single message at a time
                            
                                Kafka producer difference between flush and poll
                            
                                Kafka : How to connect kafka-console-consumer to fetch remote broker topic content?
                            
                                Kafka Producer not able to send messages
                            
                                Implications of keeping linger.ms at 0
                            
                                Implement Kafka Streams Processor in .Net?
                            
                                debezium could not access file "decoderbufs" using postgres 11 with default plugin pgoutput
                            
                                How to use from_json with Kafka connect 0.10 and Spark Structured Streaming?
                            
                                Understanding the max.inflight property of kafka producer
                            
                                can vert.x event bus replace the need for Kafka?
                            
                                Kafka Topic vs Partition topic
                            
                                Kafka in Kubernetes Cluster- How to publish/consume messages from outside of Kubernetes Cluster
                            
                                Can a Kafka producer create topics and partitions?
                            
                                Spark structured streaming kafka convert JSON without schema (infer schema)
                            
                                How to achieve delayed queue with apache kafka?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Kafka Streams and RPC: is calling REST service in map() operator considered an anti-pattern?

Tags:

apache-kafka

apache-kafka-streams

Evgeniy Khyst

People also ask

1 Answers

Michael G. Noll

Recent Activity

Donate For Us