For a micro service in new project I am currently considering whether to use DynamoDB or Aurora MySQL as the underlying data store. The micro service offers a REST API to a user interface, and there will be several other micro services. Those other micro services are supposed to listen to an event stream (event sourcing) generated by the UI-connected service to keep additional read models in sync.
I am trying to figure out a way to guarantee that the events published to the change event stream exactly match the changes to the data in the underlying data store. Generally, the concern is that if the REST API handler e.g. is interrupted half-way through its execution, it may have changed the data but not created the event yet (assuming that the change event is published after the data change). I am now looking for mechanisms that will alleviate this concern.
For DynamoDB there are DynamoDB streams and AWS Lambda Triggers to react to data changes at data store level. The triggered Lambda could transformation the low-level data change into a meaningful change event and then publish the event to SNS, SQS or Kinesis.
For Aurora MySQL I have yet to come up with such a mechanism. I have seen articles that describe two mechanisms:
One, I am not too happy with either approach: 1) I would prefer not to manage additional EC2 instances and process raw SQL changes. 2) I am planning to use constraints, optimistic concurrency and transactions for Aurora, which means that transactions can and will fail and rollback. However, the lambda_(a)sync calls will have been executed regardless of the transaction outcome.
Any better ideas for Aurora? Or am I looking at this problem from the wrong angle?
I would like to keep this question and discussion focused to the question of how to guarantee consistency between changes an the underlying data store and an outgoing stream with change events, not on Aurora vs. DynamoDB.
I found an answer that will work for our situation, using Aurora with MySQL compatibility. During my research I came across the excellent source of information at microservices.io. Specifically, the page about the event-driven architecture pattern refers to four related patterns to guarantee atomicity of updating state and publishing events.
Event sourcing is out of the question because it is way too complex for what we want to achieve. I already argued against tx log tailing in my original question. Application events and DB triggers are very similar in that, as part of a transaction, the state is updated as well as an entry is written to an Events table: Tx successfully commits, state is persisted and Event entry shows up in that table. Tx rolls back and state is unchanged and no Event entry shows up. The only difference between the two is whether Event entries get generated by the application/service logic itself, or by DB triggers.
Then an external process polls this table and publishes events for other micro-services based on the Event entries (and deletes the published ones afterwards of course). These two patterns guarantee that a state change always results in at least one event (exactly once would be a bit more complex to achieve).
Now about how to implement this... my first idea was to use a Fargate container with a Node app that does the polling, thinking that I would stay serverless with this solution. However, that turned out not to be quite true: In order to guarantee order of events, there should be just one container polling and publishing. A single Fargate container is assigned to one availability zone, and if that zone "goes away", so does the container. Now I would have to build some kind of monitoring on top so that a new container #2 gets instantiated in a different AZ #2 if and when needed. But what if AZ #1 and container #1 come back? Then there would be two instances. This is getting way too complex.
For now I settled on the following approach: A CloudWatch event triggers a polling Lambda function once a minute (min. interval for CW). Once called, the function keeps polling the DB until a second Lambda function call takes over one minute later. In order for the two Lambda function calls to coordinate, I created a second table in my DB, Event Polling State, where the most recent Lambda function call updates a dedicated row in that table, indicating to the previous function call that it started (this is done with help of SELECT ... FOR UPDATE and TXs to prevent race conditions). Before each polling cycle, the function checks the row in Event Polling State if no other function has started in the meantime.
Advantages of this approach (as I see them):
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With