Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Erlang's let-it-crash philosophy - applicable elsewhere?

Erlang's (or Joe Armstrong's?) advice NOT to use defensive programming and to let processes crash (rather than pollute your code with needless guards trying to keep track of the wreckage) makes so much sense to me now that I wonder why I wasted so much effort on error handling over the years!

What I wonder is - is this approach only applicable to platforms like Erlang? Erlang has a VM with simple native support for process supervision trees and restarting processes is really fast. Should I spend my development efforts (when not in the Erlang world) on recreating supervision trees rather than bogging myself down with top-level exception handlers, error codes, null results etc etc etc.

Do you think this change of approach would work well in (say) the .NET or Java space?

like image 273
Andrew Matthews Avatar asked Dec 08 '10 22:12

Andrew Matthews


3 Answers

It's applicable everywhere. Whether or not you write your software in a "let it crash" pattern, it will crash anyway, e.g., when hardware fails. "Let it crash" applies anywhere where you need to withstand reality. Quoth James Hamilton:

If a hardware failure requires any immediate administrative action, the service simply won’t scale cost-effectively and reliably. The entire service must be capable of surviving failure without human administrative interaction. Failure recovery must be a very simple path and that path must be tested frequently. Armando Fox of Stanford has argued that the best way to test the failure path is never to shut the service down normally. Just hard-fail it. This sounds counter-intuitive, but if the failure paths aren’t frequently used, they won’t work when needed.

This doesn't precisely mean "never use guards," though. But don't be afraid to crash!

like image 131
Craig Stuntz Avatar answered Oct 23 '22 06:10

Craig Stuntz


Yes, it is applicable everywhere, but it is important to note in which context it is meant to be used. It does not mean that the application as a whole crashes which, as @PeterM pointed out, can be catastrophic in many cases. The goal is to build a system which as a whole never crashes but can handle errors internally. In our case it was telecomms systems which are expected to have downtimes in the order of minutes per year.

The basic design is to layer the system and isolate central parts of the system to monitor and control the other parts which do the work. In OTP terminology we have supervisor and worker processes. Supervisors have the job of monitoring the workers, and other supervisors, with the goal of restarting them in the correct way when they crash while the workers do all the actual work. Structuring the system properly in layers using this principle of strictly separating the functionality allows you to isolate most of the error handling out of the workers into the supervisors. You try to end up with a small fail-safe error kernel, which if correct can handle errors anywhere in the rest of the system. It is in this context where the "let-it-crash" philosophy is meant to be used.

You get the paradox of where you are thinking about errors and failures everywhere with the goal of actually handling them in as few places as possible.

The best approach to handle an error depends of course on the error and the system. Sometimes it is best to try and catch errors locally within a process and trying to handle them there, with the option of failing again if that doesn't work. If you have a number of worker processes cooperating then it is often best to crash them all and restart them again. It is a supervisor which does this.

You do need a language which generates errors/exceptions when something goes wrong so you can trap them or have them crash the process. Just ignoring error return values is not the same thing.

like image 28
rvirding Avatar answered Oct 23 '22 05:10

rvirding


It is called fail-fast. It's a good paradigm provided you have a team of people who can respond to the failure (and do so quickly).

In the NAVY all pipes and electrical is mounted on the exterior of a wall (preferably on the more public side of a wall). That way, if there is a leak or issue, it is more likely to be detected quickly. In the NAVY, people are punished for not responding to a failure, so it works very well: failures are detected quickly and acted upon quickly.

In a scenario where someone cannot act on a failure quickly, it becomes a matter of opinion whether it is more beneficial to allow the failure to stop the system or to swallow the failure and attempt to continue onward.

like image 6
Edwin Buck Avatar answered Oct 23 '22 04:10

Edwin Buck