Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kafka Streams - what is stored in memory and disk in Streams App

I am new to Kafka Streams and I've been reading documentation on how to setup Kafka Streams application.

I am not clear though, how the data is handled - what is stored in memory and what is stored on disk. I have seen RocksDB mentioned somewhere, but not in the streams documentation.

The problem I am trying to solve is as follows. I have 2 Kafka topics, both key-value store type that keep the oldest value for each key. In my streams application I want to join both topics and output the join back to kafka that can be later consumed by some sink. What I am worried about is that it is not clear how joins are performed. Both topics will have GBs of data, so there is no chance that is going to fit in Streams App memory.

like image 338
eddyP23 Avatar asked Aug 23 '17 08:08

eddyP23


1 Answers

You can read each topic as a KTable and do a table-table join:

KTable table1 = builder.table("topic-1");
KTable table2 = builder.table("topic-2");

KTable joinResult = table1.join(table2, ...);
joinResult.to("output-topic");

For more details see: http://docs.confluent.io/current/streams/developer-guide.html#ktable-ktable-join Also check out the examples: https://github.com/confluentinc/examples/tree/3.3.0-post/kafka-streams

For runtime, both topics will be materialized in a RocksDB state store. RocksDB is able to spill to disk. Also note, that a single RocksDB instance only needs to hold the data of a single input partition. Compare http://docs.confluent.io/current/streams/architecture.html#parallelism-model

like image 193
Matthias J. Sax Avatar answered Nov 11 '22 20:11

Matthias J. Sax