Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What, exactly happens when a repartition occurs in a kafka stream?

Say I have a stream of employees, keyed by empId, which also includes departmentId. I want to aggregate by department. So I do a selectKey(mapper to get departmentId), then groupByKey() (or I could just do a a groupBy(...), I assume), and then, say, count(). What exactly happens? I gather that it does a "repartition". I think what happens is that it writes to an "internal" topic, which I is just a regular topic with a derived name, created automatically. That is, shared by all instances of the stream, not just one (i.e. not local). So the aggregation is across all of the new key, not just those messages from the source stream instance (I think). Is that correct?

I've not found a comprehensive description of repartitioning. Can anybody point me to a good article on this?

like image 206
mconner Avatar asked Mar 07 '19 20:03

mconner


People also ask

How does repartition work in Kafka?

Partitioning takes the single topic log and breaks it into multiple logs, each of which can live on a separate node in the Kafka cluster. This way, the work of storing messages, writing new messages, and processing existing messages can be split among many nodes in the cluster.

What does the sink do in a Kafka stream topology?

Finally, a sink is a node in the graph that receives records from upstream nodes and writes them to a Kafka topic. A Topology allows you to construct an acyclic graph of these nodes, and then passed into a new KafkaStreams instance that will then begin consuming, processing, and producing records .

How does Kafka aggregation work?

In the Kafka Streams DSL, an input stream of an aggregation operation can be a KStream or a KTable, but the output stream will always be a KTable. This allows Kafka Streams to update an aggregate value upon the out-of-order arrival of further records after the value was produced and emitted.

Does Kafka stream support both stateful and stateless operations?

This preview shows page 3 - 5 out of 5 pages.


1 Answers

What you describe is exactly what is happening.

A repartition step is the same as a through() (auto-inserted into the processing topology) what is a shortcut of to("topic") plus builder.stream("topic").

It's also illustrated and explained in this blog post: https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/

like image 54
Matthias J. Sax Avatar answered Sep 23 '22 11:09

Matthias J. Sax