I am trying to understand RocksDB behavior in Kafka streams processor API. I am configuring a persistent StateStore using the default RocksDB that KStreams provide.
StoreBuilder countStoreBuilder =
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("Counts"),
Serdes.String(),
Serdes.Long())
I am not doing any aggregation, join, or windowing. I am just receiving records and comparing some of them to previous items in the store and storing some of the records I receive in the state store.
The developer guide mentions that you can enable record caches in the Processor API by calling .withCachingEnabled()
on the above builder.
The cache "serves as a read cache to speed up reading data from a state store" - Record Caches Kafka Streams
However, my understanding is that RocksDB in persistent mode is first buffered in memory and will expand into disk only if the state doesn't fit in RAM.
RocksDB is just used as an internal lookup table (that is able to flush to disk if the state does not fit into memory RocksDB flushing is only required because state could be larger than available main-memory. Kafka Streams Internal Data Management
So how does record caches speed up the read from the state store if both are buffered in memory? It seems to me that record caches overlap with RocksDB behavior.
Kafka Streams uses RocksDB as the default storage engine for persistent stores. To change the default configuration for RocksDB, implement RocksDBConfigSetter and provide your custom class via rocksdb. config.
With default settings caching is enabled within Kafka Streams but RocksDB caching is disabled. Thus, to avoid high write traffic it is recommended to enable RocksDB caching if Kafka Streams caching is turned off. For example, the RocksDB Block Cache could be set to 100MB and Write Buffer size to 32 MB.
KCache is a client library that provides an in-memory cache backed by a compacted topic in Kafka. It is one of the patterns for using Kafka as a persistent store, as described by Jay Kreps in the article It's Okay to Store Data in Apache Kafka.
Memory. Kafka relies heavily on the filesystem for storing and caching messages. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. In effect this just means that it is transferred into the kernel's pagecache.
Your observation is correct and it depends on the use case if caching is desired on not. One big advantage of application level caching (instead of RocksDB caching) is that it reduces the number of records written into the changelog topic that is used to make the store fault-tolerant. Hence, it reduced the load on the Kafka cluster and also may reduce recovery time.
For DSL users, caching also has an impact on downstream load (something you might not be interested for you application, as it seems you are using the Processor API):
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With