Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Work distribution with Kafka Streams

I'm using Kafka Streams to do concurrent work on a Kafka topic.

The stream is of the following form

stream(topic)
 .map(somefunction)
 .through(secondtopic)

I've set KStreams to have 15 worker threads, but it seems like the work isn't being balanced between threads correctly (or not at all). Might there be something wrong with my setup? I was expecting that the work would be evenly distributed among the worker threads, but it seems like that's not the case.

snapshot from jvisualvm

like image 910
dmead Avatar asked Jul 04 '16 09:07

dmead


People also ask

Is Kafka distributed streaming platform?

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

How Kafka stream works internally?

Kafka Streams partitions data for processing it. In both cases, this partitioning is what enables data locality, elasticity, scalability, high performance, and fault tolerance. Kafka Streams uses the concepts of stream partitions and stream tasks as logical units of its parallelism model.

What is the difference between Kafka and Kafka Streams?

Every topic in Kafka is split into one or more partitions. Kafka partitions data for storing, transporting, and replicating it. Kafka Streams partitions data for processing it. In both cases, this partitioning enables elasticity, scalability, high performance, and fault tolerance.

What is Kafka Streams good for?

Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in an Apache Kafka® cluster. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka's server-side cluster technology.


1 Answers

You can only have as many threads as there are input Kafka topic partitions.

The messages within one partition are handled by a single thread to provide a total order over messages delivery.

Actually, in KafkaStreams input topic partitions are evenly distributed across tasks not messages.

So, the work is well balanced between threads only if messages are well balanced between partitions.

To get more information about the threading model have a look at the Confluent documentation

like image 117
fhussonnois Avatar answered Oct 22 '22 04:10

fhussonnois