Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Apache Kafka Streams uses RocksDB and if how is it possible to change it?

During investigation within new features in Apache Kafka 0.9 and 0.10, we had used KStreams and KTables. There is an interesting fact that Kafka uses RocksDB internally. See Introducing Kafka Streams: Stream Processing Made Simple. RocksDB is not written in JVM compatible language, so it needs careful handling of the deployment, as it needs extra shared library (OS dependent).

And here there are simple questions:

  • Why Apache Kafka Streams uses RocksDB?
  • How is it possible to change it?

I had tried to search the answer, but I see only implicit reason, that RocksDB is very fast for operations in the range of about millions of operations per second.

On the other hand, I see some DBs that are coded in Java and perhaps end to end they could do that as well as they are not going over JNI.

like image 611
Seweryn Habdank-Wojewódzki Avatar asked Oct 18 '16 14:10

Seweryn Habdank-Wojewódzki


People also ask

Does Kafka Streams use RocksDB?

Kafka Streams uses RocksDB as the default storage engine for persistent stores.

Why we use Kafka Streams?

Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in an Apache Kafka® cluster. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka's server-side cluster technology.

What is the difference between Apache Kafka and Kafka Streams?

Apache Kafka is the most popular open-source distributed and fault-tolerant stream processing system. Kafka Consumer provides the basic functionalities to handle messages. Kafka Streams also provides real-time stream processing on top of the Kafka Consumer client.

When should you not use Kafka Streams?

As point 1 if having just a producer producing message we don't need Kafka Stream. If consumer messages from one Kafka cluster but publish to different Kafka cluster topics. In that case, you can even use Kafka Stream but have to use a separate Producer to publish messages to different clusters.

Does Apache Kafka use kstreams?

- Stack Overflow During investigation within new features in Apache Kafka 0.9 and 0.10, we had used KStreams and KTables. There is an interesting fact that Kafka uses RocksDB internally. See Introducing Kafka Streams: Stream Processing Made Simple .

How does Kafka Streams restore state stores from RocksDB?

We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics.

What is the default state store in Kafka Streams?

As you read earlier, the default state store in Kafka Streams is RocksDB. RocksDB is an embeddable key-value persistent store. It is a C++ and Java library that you can embed into your applications. RocksDB is natively designed to give high-end performance for fast storage and server workloads.

What is the difference between Kaka consumer and Kaka streams?

Kafka Streams supports stateless and stateful operations, but Kaka Consumer only supports stateless operations. Kafka Consumer offers you the capability to write in several Kafka Clusters, whereas Kafka Streams lets you interact with a single Kafka Cluster only. Here are the steps you can follow to connect Kafka Streams to Confluent Cloud:


1 Answers

RocksDB is used for several (internal) reasons (as you mentioned already for example its performance). Conceptually, Kafka Streams does not need RocksDB -- it is used as internal key-value cache and any other store offering similar functionality would work, too.

Comment from @miguno below (rephrased):

One important advantage of RocksDB in contrast to pure in-memory key-value stores is its ability to write to disc. Thus, a state larger than available main memory can be supported by Kafka Streams.

Comment from @miguno above:

FYI: "RocksDB is not written in JVM compatible language, so it needs careful handling of the deployment, as it needs extra shared library (OS dependent)." As a user of Kafka Streams you don't need to install anything.

Using Kafka Streams DSL, as of 0.10.2 release (KAFKA-3825) it's possible to plug in custom state stores and to use a different key-value store.

Using Kafka Streams Processor API, you can implement your own store via StateStore interface and connect it to a processor node in your topology.

like image 154
Matthias J. Sax Avatar answered Sep 29 '22 08:09

Matthias J. Sax