Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kafka streams to build materialised views

I'm trying to produce some kind of materialized view from a stream of database updates (provided by e.g. the DBMS's transaction log, with the help of e.g. maxwell-daemon). The view is materialized as a Kafka compacted topic.

The view is a simple join and could be expressed as a query like this:

SELECT u.email user_email, t.title todo_title, t.state todo_state
FROM   User u
JOIN   Todo t
ON     t.user_id = u.id

I want the view to be updated every time User or Todo change (a message to be published on the view's kafka topic).

With Kafka Streams it seems to be possible to achieve that by doing this:

  • Make a KTable of User changes
  • Make a KTable of Todo changes
  • Join both

However, I'm not sure of a few things:

  • Is that even possible ?
  • Will this maintain original ordering of events ? e.g. if User is changed, then Todo is changed, am I guaranteed to see these changes in this order in the result of the join ?
  • How to handle transactions ? e.g. multiple database changes might be part of the same transaction. How to make sure that both KTables are updates atomically, and that all join results show only fully-applied transactions ?
like image 695
Arnaud Le Blanc Avatar asked Dec 29 '25 13:12

Arnaud Le Blanc


1 Answers

  • Is that even possible ?

Yes. The pattern you describe will compute what you want out-of-the-box.

  • Will this maintain original ordering of events ? e.g. if User is changed, then Todo is changed, am I guaranteed to see these changes in this order in the result of the join ?

Streams will process data according to timestamps (ie, records with smaller timestamps first). Thus, in general this will work as expected. However, there is no strict guarantee because in stream processing it's more important to make progress all the time (and don't block). Thus, Streams only applies a "best effort approach" with regard to processing records in timestamp order. For example, if one changelog does not provide any data, Streams will just keep going only processing data from the other changelog (and not block). This might lead to "out of order" processing with regard to timestamps from different partitions/topics.

  • How to handle transactions ? e.g. multiple database changes might be part of the same transaction. How to make sure that both KTables are updates atomically, and that all join results show only fully-applied transactions ?

That's not possible at the moment. Each update will be processed individually and you will see each intermediate (ie, not committed) result. However, Kafka will introduce "transactional processing" in the future that will enable to handle transactions. (see https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging and https://cwiki.apache.org/confluence/display/KAFKA/KIP-129%3A+Streams+Exactly-Once+Semantics)

like image 118
Matthias J. Sax Avatar answered Jan 01 '26 17:01

Matthias J. Sax



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!