Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to restore state in an event based, message driven microservice architecture on failure scenario

Tags:

In the context of a microservice architecture, a message driven, asynchronous, event based design seems to be gaining popularity (see here and here for some examples, as well as the Reactive Manifesto - Message Driven trait) as opposed to a synchronous (possibly REST based) mechanism.

Taking that context and imagining an overly simplified ordering system, as depicted below:

ordering system

and the following message flow:

  • Order is placed from some source (web/mobile etc.)
  • Order service accepts order and publishes a CreateOrderEvent
  • The InventoryService reacts on the CreateOrderEvent, does some inventory stuff and publishes a InventoryUpdatedEvent when it's done
  • The Invoice service then reacts to the InventoryUpdatedEvent, sends an invoice and publishes a EmailInvoiceEvent

All services are up and we happily process orders... Everyone is happy. Then, the Inventory service goes down for some reason 😬

Assuming that the events on the event bus are flowing in a "non blocking" manor. I.e. the messages are being published to a central topic and do not pile up on a queue if no service is reading from it (what I'm trying to convey is an event bus where, if the event is published on the bus, it would flow "straight through" and not queue up - ignore what messaging platform/technology is used at this point). That would mean that if the Inventory service were down for 5 minutes, the CreateOrderEvent's passing through the event bus during that time are now "gone" or not seen by the Inventory service because in our overly simplified system, no other system is interested in those events.

My question then is: How does the Inventory service (and the system as a whole) restore state in a way that no orders are missed/not processed?

like image 395
Donovan Muller Avatar asked Jun 10 '16 13:06

Donovan Muller


People also ask

How do you handle errors in event-driven architecture?

The receiving system will processes the request but return an error result. In event-driven systems, the events that carry requests can simply drop a notification into a messaging queue and move on. That means a separate process will be responsible for detecting the error and handling it as appropriate.

What kind of problems challenges there might be while working with microservices architecture?

Scalability is another operational challenge associated with microservices architecture. Although the scalability of microservices is often touted as an advantage, successfully scaling your microservice-based applications is challenging. Optimizing and scaling require more complex coordination.

What is an event-driven microservice architecture?

To begin with, in an event-driven microservice architecture, services communicate each-other via event messages. When business events occur, producers publish them with messages. At the same time, other services consume them through event listeners.

How to publish a basic event Using microservices?

To publish a basic event, at least two technologies are needed: Storage System and Message Queueing Protocol. Among all of them, the most important benefit is the first one. Because we want to separate the components by microservice architecture, all of the units must be separated enough (loosely-coupled).

How does a retry event work in a microservice?

This microservice receives an event, writing it to its own topics with both the event to retry and the timestamp to retry that event. It then pushes out these retry events once their timestamp has been reached.

What is a retry microservice in Salesforce?

The retry microservice’s job is to track and action all retries. This microservice receives an event, writing it to its own topics with both the event to retry and the timestamp to retry that event. It then pushes out these retry events once their timestamp has been reached.


1 Answers

Good question! So there are basically three forces at play here:

  1. if a service goes down, any of the events it may have missed need to be replayed to keep it consistent
  2. the events, as they happen in "time", have a "this happened before that" ordering to them
  3. there may be (but doesn't have to be) another party interested in overseeing a cloud of events to make sure a certain state is achieved.

For both #1 and #2 you want some sort of persistent log of events. A traditional message queue/topic may provide this though you have to consider the cases when messages may be processed out of order wrt to transactions/exception/fault behaviors. A more simple log like Apache Bookkeeper, Apache Kafka, AWS Kinesis etc can store/persist these types of events in sequence and leave it to the consumers to process in order/filter out duplicates/partition streams etc.

number 3 to me is a state machine. however you implement the state machine is really up to you. Basically this state machine keeps track of what events have happened and transitions to allowed states (and potentially participates in emitting events/commands) based on the events in the other systems.

For example, a real-world use case might look like an "escrow" when you're trying to close on a house. The escrow company not just handles the financial transaction, but usually they work with the real-estate agent to coordinate getting papers in order, papers signed, money transferred, etc. After each event, the escrow changes state from "waiting for buyer signature" to "waiting for seller signature" to "waiting for funds" to "closed success" ... they even have deadlines for these events to happen, etc and can transition to another state if money doesn't get transferred like "transaction closed, not finished" or something.

This state machine in your example would listen on the pub/sub channels and captures this state, runs timers, emits other events to further the systems involved, etc. It doesn't necessarily "orchestrate" them per se, but it does track the progress and enforce timeouts and compensations where needed. This could be implemented as a stream processor, as a process engine, or (imho best place to start) just a simple "escrow" service.

There's actually more to keep track of like what happens if a "escrow" service goes down/fails, how does it handle duplicates, how does it handle unexpected events given it state, how does it contribute to duplicate events, etc... but hopefully enough to get started.

like image 77
ceposta Avatar answered Oct 07 '22 08:10

ceposta