How to restore state in an event based, message driven microservice architecture on failure scenario

Tags:

In the context of a microservice architecture, a message driven, asynchronous, event based design seems to be gaining popularity (see here and here for some examples, as well as the Reactive Manifesto - Message Driven trait) as opposed to a synchronous (possibly REST based) mechanism.

Taking that context and imagining an overly simplified ordering system, as depicted below:

ordering system

and the following message flow:

Order is placed from some source (web/mobile etc.)
Order service accepts order and publishes a CreateOrderEvent
The InventoryService reacts on the CreateOrderEvent, does some inventory stuff and publishes a InventoryUpdatedEvent when it's done
The Invoice service then reacts to the InventoryUpdatedEvent, sends an invoice and publishes a EmailInvoiceEvent

All services are up and we happily process orders... Everyone is happy. Then, the Inventory service goes down for some reason 😬

Assuming that the events on the event bus are flowing in a "non blocking" manor. I.e. the messages are being published to a central topic and do not pile up on a queue if no service is reading from it (what I'm trying to convey is an event bus where, if the event is published on the bus, it would flow "straight through" and not queue up - ignore what messaging platform/technology is used at this point). That would mean that if the Inventory service were down for 5 minutes, the CreateOrderEvent's passing through the event bus during that time are now "gone" or not seen by the Inventory service because in our overly simplified system, no other system is interested in those events.

My question then is: How does the Inventory service (and the system as a whole) restore state in a way that no orders are missed/not processed?

395

asked Jun 10 '16 13:06

Donovan Muller

1 Answers

Good question! So there are basically three forces at play here:

if a service goes down, any of the events it may have missed need to be replayed to keep it consistent
the events, as they happen in "time", have a "this happened before that" ordering to them
there may be (but doesn't have to be) another party interested in overseeing a cloud of events to make sure a certain state is achieved.

For both #1 and #2 you want some sort of persistent log of events. A traditional message queue/topic may provide this though you have to consider the cases when messages may be processed out of order wrt to transactions/exception/fault behaviors. A more simple log like Apache Bookkeeper, Apache Kafka, AWS Kinesis etc can store/persist these types of events in sequence and leave it to the consumers to process in order/filter out duplicates/partition streams etc.

number 3 to me is a state machine. however you implement the state machine is really up to you. Basically this state machine keeps track of what events have happened and transitions to allowed states (and potentially participates in emitting events/commands) based on the events in the other systems.

For example, a real-world use case might look like an "escrow" when you're trying to close on a house. The escrow company not just handles the financial transaction, but usually they work with the real-estate agent to coordinate getting papers in order, papers signed, money transferred, etc. After each event, the escrow changes state from "waiting for buyer signature" to "waiting for seller signature" to "waiting for funds" to "closed success" ... they even have deadlines for these events to happen, etc and can transition to another state if money doesn't get transferred like "transaction closed, not finished" or something.

This state machine in your example would listen on the pub/sub channels and captures this state, runs timers, emits other events to further the systems involved, etc. It doesn't necessarily "orchestrate" them per se, but it does track the progress and enforce timeouts and compensations where needed. This could be implemented as a stream processor, as a process engine, or (imho best place to start) just a simple "escrow" service.

There's actually more to keep track of like what happens if a "escrow" service goes down/fails, how does it handle duplicates, how does it handle unexpected events given it state, how does it contribute to duplicate events, etc... but hopefully enough to get started.

answered Oct 07 '22 08:10

ceposta

Related questions
                            
                                Using npm with an MVC project
                            
                                Why is a lambda in C++ never DefaultConstructible
                            
                                Share Score on Facebook android
                            
                                How do you force SQL Server to release memory?
                            
                                Creating multiple bundles using angular-cli webpack
                            
                                How to achieve test isolation with Symfony forms and data transformers?
                            
                                Visual Studio - suppress certain "Exception thrown" messages
                            
                                Angular 2 - large scale application forms' handling
                            
                                Left-align last row of flexbox using space-between and margins [duplicate]
                            
                                Telegraf : How to add a "input plugin"?
                            
                                Programmatically accept call in Nougat
                            
                                build conda package from local python package

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With