Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In-memory vs persistent state stores in Kafka Streams?

I've read the stateful stream processing overview and if I understand correctly, one of the main reasons why the RocksDB is being used as a default implementation of the key value store is a fact, that unlike in-memory collections, it can handle data larger than the available memory, because it can flush to disk. Both types of stores can survive application restarts, because the data is backed up as a Kafka topic.

But are there other differences? For example, I've noticed that my persistent state store creates some .log files for each topic partition, but they're all empty.

In short, I'm wondering what are the performance benefits and possible risks of replacing persistent stores with in-memory ones.

like image 953
Dth Avatar asked Mar 08 '18 19:03

Dth


2 Answers

I don't see any real reason to swap current RocksDB store. In fact RocksDB one of the fastest k,v store: Percona benchmarks (based on RocksDB)

with in-memory ones - RocksDB already acts as in-memory with some LRU algorithms involved:

RocksDB architecture

The three basic constructs of RocksDB are memtable, sstfile and logfile. The memtable is an in-memory data structure - new writes are inserted into the memtable and are optionally written to the logfile.

But there is one more noticeable reason for choosing this implementation:

RocksDB source code

If you will look at source code ratio - there are a lot of Java api exposed from C++ code. So, it's much simpler to integrate this product in existing Java - based Kafka ecosystem with comprehensive control over store, using exposed api.

like image 25
uptoyou Avatar answered Sep 23 '22 06:09

uptoyou


I've got a very limited understanding of the internals of Kafka Streams and the different use cases of state stores, esp. in-memory vs persistent, but what I managed to learn so far is that a persistent state store is one that is stored on disk (and hence the name persistent) for a StreamTask.

That does not give much as the names themselves in-memory vs persistent may have given the same understanding, but something that I found quite refreshing was when I learnt that Kafka Streams tries to assign partitions to the same Kafka Streams instances that had the partitions assigned before (a restart or a crash).

That said, an in-memory state store is simply recreated (replayed) every restart which takes time before a Kafka Streams application is up and running while a persistent state store is something already materialized on a disk and the only time the Kafka Streams instance has to do to re-create the state store is to load the files from disk (not from the changelog topic that takes longer).

I hope that helps and I'd be very glad to be corrected if I'm wrong (or partially correct).

like image 139
Jacek Laskowski Avatar answered Sep 22 '22 06:09

Jacek Laskowski