I am new to Kafka Streams and I've been reading documentation on how to setup Kafka Streams application.
I am not clear though, how the data is handled - what is stored in memory and what is stored on disk. I have seen RocksDB mentioned somewhere, but not in the streams documentation.
The problem I am trying to solve is as follows. I have 2 Kafka topics, both key-value store type that keep the oldest value for each key. In my streams application I want to join both topics and output the join back to kafka that can be later consumed by some sink. What I am worried about is that it is not clear how joins are performed. Both topics will have GBs of data, so there is no chance that is going to fit in Streams App memory.
You can read each topic as a KTable
and do a table-table join:
KTable table1 = builder.table("topic-1");
KTable table2 = builder.table("topic-2");
KTable joinResult = table1.join(table2, ...);
joinResult.to("output-topic");
For more details see: http://docs.confluent.io/current/streams/developer-guide.html#ktable-ktable-join Also check out the examples: https://github.com/confluentinc/examples/tree/3.3.0-post/kafka-streams
For runtime, both topics will be materialized in a RocksDB state store. RocksDB is able to spill to disk. Also note, that a single RocksDB instance only needs to hold the data of a single input partition. Compare http://docs.confluent.io/current/streams/architecture.html#parallelism-model
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With