Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kafka Stream topology optimization

While readying about topology optomization, i stumble upon the following:

Currently, there are two optimizations that Kafka Streams performs when enabled:

1 - The source KTable re-uses the source topic as the changelog topic.

2 - When possible, Kafka Streams collapses multiple repartition topics into a single repartition topic.

This question is for the first point. I do not fully understand what is happening under the hood here. Just to make sure that i am not making any assumption here. Can someone explain, what was the state before:

1 - Do the KTable, use an internal changelog topic ? if yes, can someone point me to a doc about that ? Next, what is in that changelog topic ? Is it the actually upsert log, comsposed of update operation ?

2 - If my last guess is true, i do not understand how that changelog composed of upsert can be replace by the source topic only ?

like image 790
MaatDeamon Avatar asked Dec 31 '22 19:12

MaatDeamon


1 Answers

A changelog topic is a Kafka topic configured with log compaction. Each update to the KTable is written into the changelog topic. Because the topic is compacted, no data is ever lost and re-reading the changelog topic allows to re-create the local store.

The assumption of this optimization is, that the source topic is a compacted topic. For this case, the source topic and the corresponding changelog topic would contain the exact same data. Thus, the optimization removes the changelog topic and uses the source topic to re-create the state store during recovery.

If your input topic is not compacted but applies a retention time, you might not want to enable the optimization as this could result in data loss.

About the history: Initially, Kafka Streams had this optimization hardcoded (and thus "forced" users to only read compacted topics as KTables if potential data loss is not acceptable). However, in version 1.0 a regression bug was introduced (via https://issues.apache.org/jira/browse/KAFKA-3856: the new StreamsBuilder behavior was different to old KStreamBuilder and StreamsBuilder would always create a changelog topic) "removing" the optimization. In version 2.0, the issue was fixed and the optimization is available again. (cf https://issues.apache.org/jira/browse/KAFKA-6874)

Note: the optimization is only available for source KTables. For KTables that are the result of an computation, like an aggregation or other, the optimization is not available and a changelog topic will be created (if not explicitly disabled what disables fault-tolerance for the corresponding store).

like image 166
Matthias J. Sax Avatar answered May 17 '23 10:05

Matthias J. Sax