Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Confluent connect-jdbc and exactly once delivery

Is kafka-connect-jdbc safe in terms of lose and duplicate rows in case of auto incrementing primary key field in database as incrementing field?

like image 933
glarus089 Avatar asked Apr 23 '18 05:04

glarus089


1 Answers

It is absolutely not safe in auto-incrementing mode. The issue is transaction isolation and the resulting visibility characteristics — the order in which transactions are commenced (and the values of any auto-incrementing fields they might acquire) is not the same as the order in which those transactions commit. This issue will be particularly pronounced in mixed workloads, where the transactions may take a varying time to complete. So, as an observer, what you'll see in the table is temporary 'gaps' in the visible records, until such time when those transactions complete. If transaction T0 with key 0 starts before T1 with key 1, but T1 completes first, the Kafka Connect source connector will observe the effects of T1, publish the record and advance the watermark to key 1. Later, T0 will eventually commit, but by this time the source connector will have moved on.

This is a reported issue, and the Kafka Connect documentation is not transparent about the known limitations (in spite of the issue being open with the KC JDBC team since 2016).

One workaround is to use the timestamp mode (which isn't safe on its own), and add a lag via the timestamp.delay.interval.ms property. As per the Confluent documentation:

How long to wait after a row with certain timestamp appears before we include it in the result. You may choose to add some delay to allow transactions with earlier timestamp to complete. The first execution will fetch all available records (that is, starting at timestamp 0) until current time minus the delay. Every following execution will get data from the last time we fetched until current time minus the delay.

This solves one problem (awkwardly), but introduces another. Now the source sink will lag behind the 'tail' (so to speak) of the table for the duration of the timestamp delay, on the off-chance that a latent transaction will commit within that grace period. The longer the grace period — the longer the lag. So this might not be an option in applications that require near-real-time messaging.

You could try to relax the isolation level of the source sink queries, but that has other implications, especially if your application is relying on the transaction outbox pattern for guaranteed message delivery.

One safe solution with Kafka Connect is to employ CDC (change data capture) or equivalent, and point the source sink to the CDC table (which will be in commit order). You can either use raw CDC or Debezium as a 'portable' variant. This will add to the database I/O, but gives the connector a linear history to work off.

like image 152
Emil Koutanov Avatar answered Sep 28 '22 08:09

Emil Koutanov