Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kafka Stream: output to a topic first or persist directly?

A fair number of articles describe implementing the use of Kafka Streams where they output to a new Kafka topic instead of saving to some sort of distributed database.

Is this just a common use case, making the assumption that the embedded db + interactive queries is sufficient, or is there some architectural reason why one would want to output a topic before consuming it again to persist it, instead of persisting directly?

I'm not sure if it makes a difference, but the context of the examples I'm looking at is for tumbling time-windowed aggregation.

like image 334
Asmodean Avatar asked Jun 18 '17 19:06

Asmodean


People also ask

Are Kafka Streams ordered?

Kafka Streams guaranteed ordering by offsets but not by timestamp. Thus, by default "last update wins" policy is based on offsets but not on timestamp. Late arriving records ("late" defined on timestamps) are out-of-order based on timestamps and they will not be reordered to keep original offsets order.

Are Kafka Streams persistent?

Kafka Streams are backed by a persistent or in-memory state store, themselves being backed by Kafka changelog topics, providing full fault tolerance. The accompanying Kafka Streams Spring Boot application source code is available here.

What is upstream and downstream in Kafka?

Key Concepts of Kafka Streams A Streams processor is a node in a Streams topology. It receives a record from a topic or it's upstream processor and produces one or more records and write these records into a Kafka topic or downstream processor.

How do I stream to Kafka topic?

Kafka Streams DSL You can use KStream#to() to materialize the KStream to a topic. This is the canonical way to materialize data to a topic. Alternatively, you can use KStream#through() . This will also materialize data to a topic, but also returns the resulting KStream for further use.


1 Answers

If all you want is to take data out of kafka and store it in a db, then Kafka Connect is the most natural way to go.

On the other hand if your primary use-case is doing aggregation, then indeed Kafka Streams is an easy and elegant way to go about it. And if a Kafka Connect sink already exists for your preferred database, then it will be most straightforward to have Kafka Streams write output to a topic and then have that Kafka Connect sink pick it up and store in your db. If no out-of-the-box sink exists and you have to write it and you don't think it would be reusable enough, then you might choose to just write it as a custom Kafka Streams processor and not have an output Kafka topic.

As you can see there are various ways to go depending on your use-case and your preferences. There is no one correct way, so please consider the trade-offs involved.

like image 178
Michal Borowiecki Avatar answered Nov 15 '22 07:11

Michal Borowiecki