A fair number of articles describe implementing the use of Kafka Streams where they output to a new Kafka topic instead of saving to some sort of distributed database.
Is this just a common use case, making the assumption that the embedded db + interactive queries is sufficient, or is there some architectural reason why one would want to output a topic before consuming it again to persist it, instead of persisting directly?
I'm not sure if it makes a difference, but the context of the examples I'm looking at is for tumbling time-windowed aggregation.
Kafka Streams guaranteed ordering by offsets but not by timestamp. Thus, by default "last update wins" policy is based on offsets but not on timestamp. Late arriving records ("late" defined on timestamps) are out-of-order based on timestamps and they will not be reordered to keep original offsets order.
Kafka Streams are backed by a persistent or in-memory state store, themselves being backed by Kafka changelog topics, providing full fault tolerance. The accompanying Kafka Streams Spring Boot application source code is available here.
Key Concepts of Kafka Streams A Streams processor is a node in a Streams topology. It receives a record from a topic or it's upstream processor and produces one or more records and write these records into a Kafka topic or downstream processor.
Kafka Streams DSL You can use KStream#to() to materialize the KStream to a topic. This is the canonical way to materialize data to a topic. Alternatively, you can use KStream#through() . This will also materialize data to a topic, but also returns the resulting KStream for further use.
If all you want is to take data out of kafka and store it in a db, then Kafka Connect is the most natural way to go.
On the other hand if your primary use-case is doing aggregation, then indeed Kafka Streams is an easy and elegant way to go about it. And if a Kafka Connect sink already exists for your preferred database, then it will be most straightforward to have Kafka Streams write output to a topic and then have that Kafka Connect sink pick it up and store in your db. If no out-of-the-box sink exists and you have to write it and you don't think it would be reusable enough, then you might choose to just write it as a custom Kafka Streams processor and not have an output Kafka topic.
As you can see there are various ways to go depending on your use-case and your preferences. There is no one correct way, so please consider the trade-offs involved.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With