Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deploy Kafka Stream applications on Kubernetes?

My application has some aggregation/window operation, so it has some state store which stores in the state.dir. AFAIK, it also writes the changelog of state store to the broker, so is that OK to consider the Kafka Stream application as a stateless POD?

like image 336
a.l. Avatar asked Mar 26 '18 01:03

a.l.


People also ask

Can you run Kafka in Kubernetes?

Apache Kafka is frequently deployed on the Kubernetes container management system, which is used to automate deployment, scaling, and operation of containers across clusters of hosts. Apache Kafka on Kubernetes goes hand-in-hand with cloud-native development, the next generation of application development.

How does Kafka work with Kubernetes?

Kafka runs as a cluster of brokers, and these brokers can be deployed across a Kubernetes system and made to land on different workers across separate fault domains. Kubernetes automatically recovers pods when nodes or containers fail, so it can do this for your brokers too.


1 Answers

My application has some aggregation/window operation, so it has some state store which stores in the state.dir. AFAIK, it also writes the changelog of state store to the broker, so is that OK to consider the Kafka Stream application as a stateless POD?

Stateless pod and data safety (= no data loss): Yes, you can consider the application as a stateless pod as far as data safety is concerned; i.e. regardless of what happens to the pod Kafka and Kafka Streams guarantee that you will not lose data (and if you have enabled exactly-once processing, they will also guarantee the latter).

That's because, as you already said, state changes in your application are always continuously backed up to Kafka (brokers) via changelogs of the respective state stores -- unless you explicitly disabled this changelog functionality (it is enabled by default).

Note: The above is even true when you are not using Kafka's Streams default storage engine (RocksDB) but the alternative in-memory storage engine. Many people don't realize this because they read "in-memory" and (falsely) conclude "data will be lost when a machine crashes, restarts, etc.".

Stateless pod and application restoration/recovery time: The above being said, you should understand how having vs. not-having local state available after pod restarts will impact restoration/recovery time of your application (or rather: application instance) until it is fully operational again.

Imagine that one instance of your stateful application runs on a machine. It will store its local state under state.dir, and it will also continuously backup any changes to its local state to the remote Kafka cluster (brokers).

  • If the app instance is being restarted and does not have access to its previous state.dir (probably because it is restarted on a different machine), it will fully reconstruct its state by restoring from the associated changelog(s) in Kafka. Depending on the size of your state this may take milliseconds, seconds, minutes, or more. Only once its state is fully restored it will begin processing new data.
  • If the app instance is being restarted and does have access to its previous state.dir (probably because it is restarted on the same, original machine), it can recover much more quickly because it can re-use all or most of the existing local state, so only a small delta needs to restored from the associated changelog(s). Only once its state is fully restored it will begin processing new data.

In other words, if your application is able to re-use existing local state then this is good because it will minimize application recovery time.

Standby replicas to the rescue in stateless environments: But even if you are running stateless pods you have options to minimize application recovery times by configuring your application to use standby replicas via the num.standby.replicas setting:

num.standby.replicas

The number of standby replicas. Standby replicas are shadow copies of local state stores. Kafka Streams attempts to create the specified number of replicas and keep them up to date as long as there are enough instances running. Standby replicas are used to minimize the latency of task failover. A task that was previously running on a failed instance is preferred to restart on an instance that has standby replicas so that the local state store restoration process from its changelog can be minimized.

See also the documentation section State restoration during workload rebalance

Update 2018-08-29: Arguably the most convenient option to run Kafka/Kafka Streams/KSQL on Kubernetes is to use Confluent Operator or the Helm Charts provided by Confluent, see https://www.confluent.io/confluent-operator/. (Disclaimer: I work for Confluent.)

Update 2019-01-10: There's also a Youtube video that demoes how to Scale Kafka Streams with Kubernetes.

like image 158
Michael G. Noll Avatar answered Oct 15 '22 06:10

Michael G. Noll