I just read the Hystrix guide and am trying to wrap my head around how the default circuit breaker and recovery period operate, and then how to customize their behavior.
Obviously, if the circuit is tripped, Hystrix will automatically call the command's getFallBack() method; this much I understand. But what criteria go into making the circuit tripped in the first place? Ideally, I'd like to try hitting a backing service several times (say, a max of 3 attempts) before we consider the service to be offline/unhealthy and trip the circuit breaker. How could I implement this, and where?
But I imagine that if I override the default circuit breaker, I must also override whatever mechanism handles the default recovery period. If a backing service goes down, it could be for any one of several reasons:
In most of these cases, it is not sufficient to have a recovery period that merely waits N seconds and then tries again. If the service has a bug in it, or if someone pulled some network cables in the data center, we will always get failures from this service. Only in a small number of cases will the client-service automagically heal itself without any human interaction.
So I guess my next question is partially "How do I customize the default recovery period strategy?", but I guess it is mainly: "How do I use Hystrix to notify devops when a service is down and requires manual intervention?"
The circuit breaker calculates when to open and close the circuit and what to do in case of a failure. Hystrix watches for failures on those annotated methods, and opens the circuit so that subsequent calls will automatically fail. i.e. when circuit is open, Hystrix redirects calls to the fallback method.
Hystrix helps by providing protection and control over latency and failure from dependencies, most commonly those accessed over network. It helps stop cascading failures and allows you to fail fast and rapidly recover, or fallback and gracefully degrade. Here's how it works.
Having an open circuit stops cascading failures and allows overwhelmed or failing services time to recover. The fallback can be another Hystrix protected call, static data, or a sensible empty value.
Hystrix is a library that controls the interaction between microservices to provide latency and fault tolerance. Additionally, it makes sense to modify the UI to let the user know that something might not have worked as expected or would take more time.
there are basically four reasons for Hystrix to call the fallback method: an exception, a timeout, too many parallel requests, or too many exceptions in the previous calls.
You might want to do a retry in your run() method if the return code or the exception you receive from your service indicate that a retry makes sense.
In your fallback method of the command you might retry when there was a timeout - when there where too many parallel requests or too many exceptions it usually makes no sense to call the same service again.
As also asked how to notify devops: You should connect a monitoring system to Hystrix that polls the status of the circuit breaker and the ratio of successful and unsuccessful calls. You can use the metrics publishers provided, JMX, or write your own adapter using Hystrix' API. I've written two adapters for Riemann and Zabbix in a tutorial I prepared; you'll new very few lines of code for that.
The tutorial also has a example application and a load driver to try out some scenarios.
Br, Alexander.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With