I've read the stateful stream processing overview and if I understand correctly, one of the main reasons why the RocksDB is being used as a default implementation of the key value store is a fact, that unlike in-memory collections, it can handle data larger than the available memory, because it can flush to disk. Both types of stores can survive application restarts, because the data is backed up as a Kafka topic.
But are there other differences? For example, I've noticed that my persistent state store creates some .log files for each topic partition, but they're all empty.
In short, I'm wondering what are the performance benefits and possible risks of replacing persistent stores with in-memory ones.
I don't see any real reason to swap current RocksDB store. In fact RocksDB one of the fastest k,v store: Percona benchmarks (based on RocksDB)
with in-memory ones
- RocksDB already acts as in-memory with some LRU
algorithms involved:
RocksDB architecture
The three basic constructs of RocksDB are memtable, sstfile and logfile. The memtable is an in-memory data structure - new writes are inserted into the memtable and are optionally written to the logfile.
But there is one more noticeable reason for choosing this implementation:
RocksDB source code
If you will look at source code ratio - there are a lot of Java
api exposed from C++
code. So, it's much simpler to integrate this product in existing Java - based
Kafka ecosystem with comprehensive control over store, using exposed api.
I've got a very limited understanding of the internals of Kafka Streams and the different use cases of state stores, esp. in-memory vs persistent, but what I managed to learn so far is that a persistent state store is one that is stored on disk (and hence the name persistent) for a StreamTask
.
That does not give much as the names themselves in-memory vs persistent may have given the same understanding, but something that I found quite refreshing was when I learnt that Kafka Streams tries to assign partitions to the same Kafka Streams instances that had the partitions assigned before (a restart or a crash).
That said, an in-memory state store is simply recreated (replayed) every restart which takes time before a Kafka Streams application is up and running while a persistent state store is something already materialized on a disk and the only time the Kafka Streams instance has to do to re-create the state store is to load the files from disk (not from the changelog topic that takes longer).
I hope that helps and I'd be very glad to be corrected if I'm wrong (or partially correct).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With