We have 7 microservices communicated via eventbus. We have a real-time transaction sequence: Service 1->service2->service3 (and so on.) Until transactions considered as completed We must make sure all transactions happened. Ofcourse we can have failures at any point. So we are thinking about mechanisem to replay "half-baked" transactions into completion. It's getting tricky. Two ways we thought about: <ol> <li> Having another service (supervisor service) that will log each part in our real time sequence and will be smart enough when transactions are not completed (timedout) to understand how we can continune from left point Disadvantages: lots of "smart" logic on one central service </li> <li> having retry mechanisem on every service while each one taking care of it's own and replay it's own until success or exhusated Disadvantages: lots of retry duplicated code on each service </li> </ol> What do you experts think? Thank

What you seem to be talking about is how to deal with transactions in a distributed architecture. This is an extensive topic and entire books could be written about this. Your question seems to be just about retrying the transactions, but I believe that alone is probably not enough to solve the problem of distributed transactional workflow. I believe you could probably benefit from gaining more understanding of concepts like: <ul> <li>Compensating Transactions Pattern</li> <li>Try/Cancel/Confirm Pattern</li> <li>Long Running Transactions</li> <li>Sagas</li> </ul> The idea behind compensating transactions is that every ying has its yang: if you have one transaction that can place an order, then you could undo that with a transaction that cancels an order. This latter transaction is a compensating transaction. So, if you carry out a number of successful transactions and then one of them fails, you can trace back your steps and compensate every successful transaction you did and, as a result, revert their side effects. I particularly liked a chapter in the book REST from Research to Practice. Its chapter 23 (Towards Distributed Atomic Transactions over RESTful Services) goes deep in explaining the Try/Cancel/Confirm pattern. In general terms it implies that when you do a group of transactions, their side effects are not effective until a transaction coordinator gets a confirmation that they all were successful. For example, if you make a reservation in Expedia and your flight has two legs with different airlines, then one transaction would reserve a flight with American Airlines and another one would reserve a flight with United Airlines. If your second reservation fails, then you want to compensate the first one. But not only that, you want to avoid that the first reservation is effective until you have been able to confirm both. So, initial transaction makes the reservation but keeps its side effects pending to confirm. And the second reservation would do the same. Once the transaction coordinator knows everything is reserved, it can send a confirmation message to all parties such that they confirm their reservations. If reservations are not confirmed within a sensible time window, they are automatically reversed by the affected system. The book Enterprise Integration Patterns has some basic ideas on how to implement this kind of event coordination (e.g. see process manager pattern and compare with routing slip pattern which are similar ideas to orchestration vs choreography in the Microservices world). As you can see, being able to compensate transactions might be complicated depending on how complex is your distributed workflow. The process manager may need to keep track of the state of every step and know when the whole thing needs to be undone. This is pretty much that idea of Sagas in the Microservices world. The book Microservices Patterns has an entire chapter called Managing Transactions with Sagas that delves in detail on how to implement this type of solution. A few other aspects I also typically consider are the following: Idempotency I believe that a key to a successful implementation of your service transactions in a distributed system consists in making them idempotent. Once you can guarantee a given service is idempotent, then you can safely retry it without worrying about causing additional side effects. However, just retrying a failed transaction won't solve your problems. Transient vs Persistent Errors When it comes to retrying a service transaction, you shouldn't just retry because it failed. You must first know why it failed and depending on the error it might make sense to retry or not. Some types of errors are transient, for example, if one transaction fails due to a query timeout, that's probably fine to retry and most likely it will succeed the second time; but if you get a database constraint violation error (e.g. because a DBA added a check constraint to a field), then there is no point in retrying that transaction: no matter how many times you try it will fail. Embrace Error as an Alternative Flow In those cases of inter-service communication (computer-to-computer interactions) , when a given step of your workflow fails, you don't necessarily need to undo everything you did in previous steps. You can embrace error as part of your workflow. Catalog the possible causes of failure and make them an alternative flow of events that merely requires human intervention. It is just another step in the full orchestration that requires a person to intervene to make a decision, resolve an inconsistency with the data or just approve which way to go. For example, maybe when you're processing an order, the payment service fails because you don't have enough funds. So, there is no point in undoing everything else. All we need is to put the order in a state that some problem solver can address it in the system and, once fixed, you can continue with the rest of the workflow. Transaction and Data Model State are Key I have discovered that this type of transactional workflows require a good design of the different states your model has to go through. As in the case of Try/Cancel/Confirm pattern, this implies initially applying the side effects without necessarily making the data model available to the users. For example, when you place an order, maybe you add it to the database in a "Pending" status that will not appear in the UI of the warehouse systems. Once payments have been confirmed the order will then appear in the UI such that a user can finally process its shipments. The difficulty here is discovering how to design transaction granularity in a way that even if one step of your transaction workflow fails, the system remains in a valid state from which you can resume once the cause of the failure is corrected. Designing for Distributed Transactional Workflows So, as you can see, designing a distributed system that works in this way is a bit more complicated than individually invoking distributed transactional services. Now every service invocation may fail for a number of reasons and leave your distributed workflow in an inconsistent state. And retrying the transaction may not always solve the problem. And your data needs to be modeled like a state machine, such that side effects are applied but not confirmed until the entire orchestration is successful. That‘s why the whole thing may need to be designed in a different way than you would typically do in a monolithic client-server application. Your users may now be part of the designed solution when it comes to solving conflicts, and contemplate that transactional orchestrations could potentially take hours or even days to complete depending on how their conflicts are resolved. As I was initially saying, the topic is way too broad, and it would require a more specific question to discuss, perhaps, just one or two of these aspects in detail. At any rate, I hope this somehow helped you with your investigation.

How to create replay mechanism within event-drive microservice

1 Answers

What you seem to be talking about is how to deal with transactions in a distributed architecture.

This is an extensive topic and entire books could be written about this. Your question seems to be just about retrying the transactions, but I believe that alone is probably not enough to solve the problem of distributed transactional workflow.

I believe you could probably benefit from gaining more understanding of concepts like:

Compensating Transactions Pattern
Try/Cancel/Confirm Pattern
Long Running Transactions
Sagas

The idea behind compensating transactions is that every ying has its yang: if you have one transaction that can place an order, then you could undo that with a transaction that cancels an order. This latter transaction is a compensating transaction. So, if you carry out a number of successful transactions and then one of them fails, you can trace back your steps and compensate every successful transaction you did and, as a result, revert their side effects.

I particularly liked a chapter in the book REST from Research to Practice. Its chapter 23 (Towards Distributed Atomic Transactions over RESTful Services) goes deep in explaining the Try/Cancel/Confirm pattern.

In general terms it implies that when you do a group of transactions, their side effects are not effective until a transaction coordinator gets a confirmation that they all were successful. For example, if you make a reservation in Expedia and your flight has two legs with different airlines, then one transaction would reserve a flight with American Airlines and another one would reserve a flight with United Airlines. If your second reservation fails, then you want to compensate the first one. But not only that, you want to avoid that the first reservation is effective until you have been able to confirm both. So, initial transaction makes the reservation but keeps its side effects pending to confirm. And the second reservation would do the same. Once the transaction coordinator knows everything is reserved, it can send a confirmation message to all parties such that they confirm their reservations. If reservations are not confirmed within a sensible time window, they are automatically reversed by the affected system.

The book Enterprise Integration Patterns has some basic ideas on how to implement this kind of event coordination (e.g. see process manager pattern and compare with routing slip pattern which are similar ideas to orchestration vs choreography in the Microservices world).

As you can see, being able to compensate transactions might be complicated depending on how complex is your distributed workflow. The process manager may need to keep track of the state of every step and know when the whole thing needs to be undone. This is pretty much that idea of Sagas in the Microservices world.

The book Microservices Patterns has an entire chapter called Managing Transactions with Sagas that delves in detail on how to implement this type of solution.

A few other aspects I also typically consider are the following:

Idempotency

I believe that a key to a successful implementation of your service transactions in a distributed system consists in making them idempotent. Once you can guarantee a given service is idempotent, then you can safely retry it without worrying about causing additional side effects. However, just retrying a failed transaction won't solve your problems.

Transient vs Persistent Errors

When it comes to retrying a service transaction, you shouldn't just retry because it failed. You must first know why it failed and depending on the error it might make sense to retry or not. Some types of errors are transient, for example, if one transaction fails due to a query timeout, that's probably fine to retry and most likely it will succeed the second time; but if you get a database constraint violation error (e.g. because a DBA added a check constraint to a field), then there is no point in retrying that transaction: no matter how many times you try it will fail.

Embrace Error as an Alternative Flow

In those cases of inter-service communication (computer-to-computer interactions) , when a given step of your workflow fails, you don't necessarily need to undo everything you did in previous steps. You can embrace error as part of your workflow. Catalog the possible causes of failure and make them an alternative flow of events that merely requires human intervention. It is just another step in the full orchestration that requires a person to intervene to make a decision, resolve an inconsistency with the data or just approve which way to go.

For example, maybe when you're processing an order, the payment service fails because you don't have enough funds. So, there is no point in undoing everything else. All we need is to put the order in a state that some problem solver can address it in the system and, once fixed, you can continue with the rest of the workflow.

Transaction and Data Model State are Key

I have discovered that this type of transactional workflows require a good design of the different states your model has to go through. As in the case of Try/Cancel/Confirm pattern, this implies initially applying the side effects without necessarily making the data model available to the users.

For example, when you place an order, maybe you add it to the database in a "Pending" status that will not appear in the UI of the warehouse systems. Once payments have been confirmed the order will then appear in the UI such that a user can finally process its shipments.

The difficulty here is discovering how to design transaction granularity in a way that even if one step of your transaction workflow fails, the system remains in a valid state from which you can resume once the cause of the failure is corrected.

Designing for Distributed Transactional Workflows

So, as you can see, designing a distributed system that works in this way is a bit more complicated than individually invoking distributed transactional services. Now every service invocation may fail for a number of reasons and leave your distributed workflow in an inconsistent state. And retrying the transaction may not always solve the problem. And your data needs to be modeled like a state machine, such that side effects are applied but not confirmed until the entire orchestration is successful.

That‘s why the whole thing may need to be designed in a different way than you would typically do in a monolithic client-server application. Your users may now be part of the designed solution when it comes to solving conflicts, and contemplate that transactional orchestrations could potentially take hours or even days to complete depending on how their conflicts are resolved.

As I was initially saying, the topic is way too broad, and it would require a more specific question to discuss, perhaps, just one or two of these aspects in detail.

At any rate, I hope this somehow helped you with your investigation.

answered Nov 01 '22 09:11

Edwin Dalorzo

Related questions
                            
                                Java Selenium Chrome driver - Disable logging
                            
                                CompletableFuture multi-threaded, single thread concurrent, or both?
                            
                                Getting Platform declaration clash when using Interface in kotlin [duplicate]
                            
                                Jacoco maven plugin clogs up console with Exceptions-java.lang.IllegalStateException: class is already instrumented
                            
                                How does index*int in a for loop end up with zero as result?
                            
                                AWS API Gateway custom authorizer. How to access principalId in lambda
                            
                                Calling real method in Mockito, but intercepting the result
                            
                                Initializing an array of pairs in Java
                            
                                java.lang.NoClassDefFoundError: org/apache/commons/lang3/ObjectUtils
                            
                                Delete native peer with general PhantomReference class
                            
                                Android studio Cannot resolve symbol 'NotificationChannel'
                            
                                Stop java stream computations based on previous computation results
                            
                                Android Bindings Stopped working
                            
                                Scale up ElasticBeanStalk environment programmatically
                            
                                Limit modules added by javapackager
                            
                                Java PriorityQueue with custom Comparator
                            
                                How to Restore ViewPager from savedInstanceState
                            
                                Naming convention for instances of java.util.Comparator
                            
                                Hibernate/SpringData : Incorrect dirty check on field with AttributeConverter
                            
                                Is there a more elegant way to get random not used item from list using java 8?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to create replay mechanism within event-drive microservice

Tags:

java

architecture

transactions

microservices

event-driven-design

rayman

People also ask

1 Answers

Edwin Dalorzo

Recent Activity

Donate For Us