The naive approach for implementing the use case of enriching an incoming stream of events stored in Kafka with reference data - is by calling in map()
operator an external service REST API that provides this reference data, for each incoming event.
eventStream.map((key, event) -> /* query the external service here, then return the enriched event */)
Another approach is to have second events stream with reference data and store it in KTable
that will be a lightweight embedded "database" then join main event stream with it.
KStream<String, Object> eventStream = builder.stream(..., "event-topic");
KTable<String, Object> referenceDataTable = builder.table(..., "reference-data-topic");
KTable<String, Object> enrichedEventStream = eventStream
.leftJoin(referenceDataTable , (event, referenceData) -> /* return the enriched event */)
.map((key, enrichedEvent) -> new KeyValue<>(/* new key */, enrichedEvent)
.to("enriched-event-topic", ...);
Can the "naive" approach be considered an anti-pattern? Can the "KTable
" approach be recommended as the preferred one?
Kafka can easily manage millions of messages per minute. Service that is called from the map()
operator should be capable of handling high load too and also highly-available. These are extra requirements for the service implementation. But if the service satisfies these criteria can the "naive" approach be used?
Kafka RPC, a RPC protocol that based on kafka, is meant to provide a swift, stable, reliable remote calling service. The reason we love about kafka is its fault tolerance, scalability and wicked large throughput. So if you want a RPC service with kafka features, kRPC is the kind of tool you're looking for.
To allow application instances to communicate over the network, you must add a Remote Procedure Call (RPC) layer to your application (e.g., REST API). This table shows the Kafka Streams native communication support for various procedures. Procedure. Application instance. Entire application.
Yes, it is ok to do RPC inside Kafka Streams operations such as map()
operation. You just need to be aware of the pros and cons of doing so, see below. Also, you should do any such RPC calls synchronously from within your operations (I won't go into details here why; if needed, I'd suggest to create a new question).
Pros of doing RPC calls from within Kafka Streams operations:
Cons:
map()
) is a side-effect and thus a black box for Kafka Streams. The processing guarantees of Kafka Streams do not extend to such side effects.
map()
will be idempotent. Ensuring the latter is your responsibility.Alternatives
In case you are wondering what other alternatives you have: If, for example, you are doing RPC calls for looking up data (e.g. for enriching an incoming stream of events with side/context information), you can address the downsides above by making the lookup data available in Kafka directly. If the lookup data is in MySQL, you can setup a Kafka connector to continuously ingest the MySQL data into a Kafka topic (think: CDC). In Kafka Streams, you can then read the lookup data into a KTable
and perform the enrichment of your input stream via a stream-table join.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With