Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you design the architecture of an Erlang/OTP-based distributed fault-tolerant multicore system?

I would like to build an Erlang/OTP-based system which solves an 'embarassingly parrallel' problem.

I have already read/skimmed through:

  • Learn You Some Erlang;
  • Programming Erlang (Armstrong);
  • Erlang Programming (Cesarini);
  • Erlang/OTP in Action.

I have got the gist of Processes, Messaging, Supervisors, gen_servers, Logging, etc.

I do understand that certain architecture choices depend on the application in concern, but still I would like know some general principles of ERlang/OTP system design.

Should I just start with a few gen_servers with a supervisor and incrementally build on that?

How many supervisors should I have? How do I decide which parts of the system should be process-based? How should I avoid bottlenecks?

Should I add logging later?

What is the general approach to Erlang/OTP distributed fault-tolerant multiprocessors systems architecture?

like image 625
skanatek Avatar asked Sep 05 '11 11:09

skanatek


People also ask

What is OTP architecture?

OTP is a set of Erlang libraries and design principles providing ready-made tools with which to develop robust systems. Many of these patterns and libraries are provided in the form of "behaviors." OTP behaviors address these issues by providing library modules that implement the most common concurrent design patterns.

What is Erlang OTP used for?

1.1 Erlang and OTP OTP (Open Telecom Platform) is aimed at providing time-saving and flexible development for robust, adaptable telecom systems. It consists of an Erlang runtime system, a number of ready-to-use components mainly written in Erlang, and a set of design principles for Erlang programs.

What is Erlang OTP 20?

Erlang/OTP 20.0 is a new major release with new features, quite a few (characteristics) improvements, as well as a few incompatibilities.

What is Erlang OTP 22 used for?

It is intended to eventually replace the inet driver. It also provides a basic API that facilitates the implementation of other protocols than TCP, UDP and SCTP. Known issues are; No support for the Windows OS (currently), a small term leakage.


1 Answers

Should I just start with a few gen_servers with a supervisor and incrementally build on that?

You're missing one key component in Erlang architectures here: applications! (That is, the concept of OTP applications, not software applications).

Think of applications as components. A component in your system solves a particular problem, is responsible for a coherent set of resources or abstract something important or complex from the system.

The first step when designing an Erlang system is to decide which applications are needed. Some can be pulled from the web as they are, these we can refer to as libraries. Others you'll need to write yourself (otherwise you wouldn't need this particular system). These applications we usually refer to as the business logic (often you need to write some libraries yourself as well, but it is useful to keep the distinction between the libraries and the core business applications that tie everything together).

How many supervisors should I have?

You should have one supervisor for each kind of process you want to monitor.

A bunch of identical temporary workers? One supervisor to rule them all.

Different process with different responsibilities and restart strategies? A supervisor for each different type of process, in a correct hierarchy (depending on when things should restart and what other process needs to go down with them?).

Sometimes it is okay to put a bunch of different process types under the same supervisor. This is usually the case when you have a few singleton processes (e.g. one HTTP server supervisor, one ETS table owner process, one statistics collector) that will always run. In that case, it might be too much cruft to have one supervisor for each, so it is common to add the under one supervisor. Just be aware of the implications of using a particular restart strategy when doing this, so you don't take down your statistics process for example, in case your web server crashes (one_for_one is the most common strategy to use in cases like this). Be careful not to have any dependencies between processes in a one_for_one supervisor. If a process depends on another crashed process, it can crash as well, triggering the supervisors' restart intensity too often and crash the supervisor itself too soon. This can be avoided by having two different supervisors, which would completely control the restarts by the configured intensity and period (longer explanation).

How do I decide which parts of the system should be process-based?

Every concurrent activity in your system should be in it's own process. Having the wrong abstraction of concurrency is the most common mistake by Erlang system designers in the beginning.

Some people are not used to deal with concurrency; their systems tend to have too little of it. One process, or a few gigantic ones, that runs everything in sequence. These systems are usually full of code smell and the code is very rigid and hard to refactor. It also makes them slower, because they may not use all the cores available to Erlang.

Other people immediately grasp the concurrency concepts but fail to apply them optimally; their systems tend to overuse the process concept, making many process stay idle waiting for others that are doing work. These systems tend to be unnecessarily complex and hard to debug.

In essence, in both variants you get the same problem, you don't use all the concurrency available to you and you don't get the maximum performance out of the system.

If you stick to the single responsibility principle and abide by the rule to have a process for every truly concurrent activity in your system, you should be okay.

There are valid reasons to have idle processes. Sometimes they keep important state, sometimes you want to keep some data temporarily and later discard the process, sometimes they wait on external events. The bigger pitfall is to pass important messages through a long chain of largely inactive processes, as it will slow down your system with lots of copying and use more memory.

How should I avoid bottlenecks?

Hard to say, depends very much on your system and what it's doing. Generally though, if you have a good division of responsibility between applications you should be able to scale the application that appears to be the bottleneck separately from the rest of the system.

The golden rule here is to measure, measure, measure! Don't think you have something to improve until you've measured.

Erlang is great in that it allows you to hide concurrency behind interfaces (known as implicit concurrency). For example, you use a functional module API, a normal module:function(Arguments) interface, that could in turn spawn thousands of processes without the caller having to know that. If you got your abstractions and your API right, you can always parallelize or optimize a library after you've started using it.

That being said, here are some general guide lines:

  • Try to send messages to the recipient directly, avoid channeling or routing messages through intermediary processes. Otherwise the system just spends time moving messages (data) around without really working.
  • Don't overuse the OTP design patterns, such as gen_servers. In many cases, you only need to start a process, run some piece of code, and then exit. For this, a gen_server is overkill.

And one bonus advice: don't reuse processes. Spawning a process in Erlang is so cheap and quick that it doesn't make sense to re-use a process once its lifetime is over. In some cases it might make sense to re-use state (e.g. complex parsing of a file) but that is better canonically stored somewhere else (in an ETS table, database etc.).

Should I add logging later?

You should add logging now! There's a great built-in API called Logger that comes with Erlang/OTP from version 21:

logger:error("The file does not exist: ~ts",[Filename]), logger:notice("Something strange happened!"), logger:debug(#{got => connection_request, id => Id, state => State},              #{report_cb => fun(R) -> {"~p",[R]} end}), 

This new API has several advanced features and should cover most cases where you need logging. There's also the older but still widely used 3rd party library Lager.

What is the general approach to Erlang/OTP distributed fault-tolerant multiprocessors systems architecture?

To summarize what's been said above:

  • Divide your system into applications
  • Put your processes in the correct supervision hierarchy, depending on their needs and dependencies
  • Have a process for every truly concurrent activity in your system
  • Maintain a functional API towards the other components in the system. This lets you:
    • Refactor your code without changing the code that's using it
    • Optimize code afterwards
    • Distribute your system when needed (just make a call to another node behind the API! The caller won't notice!)
    • Test the code more easily (less work setting up test harnesses, easier to understand how to use it)
  • Start using the libraries available to you in OTP until you need something different (you'll know, when the time comes)

Common pitfalls:

  • Too many processes
  • Too few processes
  • Too much routing (forwarded messages, chained processes)
  • Too few applications (I've never seen the opposite case, actually)
  • Not enough abstraction (makes it hard to refactor and reason about. It also makes it hard to test!)
like image 171
Adam Lindberg Avatar answered Oct 10 '22 00:10

Adam Lindberg