The usual way of implementing the outbox pattern is to store the message payload in an outbox table and have a separate process (the Message Relay) query for pending messages and publish them into a message broker, Kafka in my case.
The state of the outbox table could be as shown below.
OUTBOX TABLE
---------------------------------
|ID | STATE | TOPIC | PAYLOAD |
---------------------------------
| 1 | PROCESSED | user |
| 2 | PENDING | user |
| 3 | PENDING | billing |
----------------------------------
My Message Relay is a Spring Boot/Cloud Stream application that periodically (@Scheduled
) looks for PENDING records, publishes them into Kafka and updates the record to a PROCESSED state.
The first problem is: if I start multiple instances of the Message Relay all of them would query the Outbox table, and possibly at some point different instances would get the same PENDING registries to publish into Kafka, generating duplicated messages. How can I prevent this?
Another situation: supposing only one Message Relay. It gets one PENDING record, publishes it to the topic but crashes before updating the record to PROCESSED. When it starts up again it would find the same PENDING record and publish it again. Is there a way to avoid this duplication or the only way is to design an idempotent system.
To prevent the first problem you have to use database locking.
SELECT * FROM outbox WHERE id = 1 FOR UPDATE
This will prevent other processes from accessing the same row.
The second problem you cannot solve because you don't have distributed transaction with Kafka.
So one way could be to set the record to a state like PROCESSING before sending it to Kafka and if the application crashes you should check if there are records in state PROCESSING and doing some clean up task to find out if they were already sent to Kafka.
But the best solution would be to have an idempotent system that can handle duplicates.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With