Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kafka Streams - reusing streams using through() vs toStream() + to()

I want to know the difference of reusing streams using .through() over stream reference from .toStream() + .to()

Using .through()

KStream<String, String> subStream = mainStream .groupByKey(..) .aggregate(..) .toStream(..); .through("aggregate-topic", ..); // Then use the (new) stream from .through() to create another topic

vs Using .toStream() + .to()

KStream<String, String> subStream = mainStream .groupByKey(..) .aggregate(..) .toStream(..); subStream.to("aggregate-topic", ..); //reuse the existing subStream from toStream() to create another topic

I have implemented a feature that uses the latter because that's what made sense before I learned the through() method.

What I'm curious now is the internal stuff that happens for both options; are there any benefits/disadvantages for choosing one option over the other?

like image 854
Rich P Avatar asked Dec 08 '22 13:12

Rich P


1 Answers

Yes, there is a difference and different tradeoffs:

  1. The first version using through() will create a "linear plan" and will split the topology in two sub-topologies. Note that through("topic") is the exact some thing as to("topic") plus builder.stream("topic").

    mainStream -> grp -> agg -> toStream -> to -> TOPIC -> builder.stream -> subStream

The first sub-topology will be from mainStream to to(); the "aggregate-topic" separates it from the second sub-topology that consist of builder.stream() and feeds into subStream. This implies, that all data is written into "aggregate-topic" first, and read back afterwards. This will increase end-to-end processing latency and increases broker load for the additional read operation. The advantage is, that both sub-topologies can be parallelized independently. Their parallelism is independent and determined by the number of their corresponding input topic partitions. This will create more tasks, and thus allows for more parallelism as both sub-topologies can be executed on different threads.

  1. The second version will create a "branched plan" and will be executed as one sub-topology:

    mainStream -> grp -> agg -> toStream -+-> to -> TOPIC | + -> subStream

After the toStream() the data is logically broadcasted into both downstream operators. This implies, that there is no round-trip through the "aggregate-topic" but records are forwarded in-memory to subStream. This reduces the end-to-end latency and does not required to read data back from the Kafka cluster. However, you have less task and thus reduced maximum parallelism.

like image 180
Matthias J. Sax Avatar answered Apr 02 '23 14:04

Matthias J. Sax