Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Global Dynamic Supervisor in a cluster

I have a unique issue that I have not had a need to address in elxir.

I need to use the dynamic supervisor to start (n) amount of children dynamicly in a clustered environment. I am using libcluster to manage the clustering and use the global process registry to lookup the dynamic supervisor pid.. Here is what is happening:

global: Name conflict terminating {:packer_supervisor, #PID<31555.1430.0>}

Here is the code for the supervisor:

defmodule EcompackingCore.PackerSupervisor do
  use DynamicSupervisor
  require Logger

  def start_link() do
    DynamicSupervisor.start_link(__MODULE__, :ok, name: {:global, :packer_supervisor})
  end

  def init(:ok) do
    Logger.info("Starting Packer Supervisor")
    DynamicSupervisor.init(strategy: :one_for_one)
  end

  def add_packer(badge_id, packer_name) do
    child_spec = {EcompackingCore.Packer, {badge_id, packer_name}}
    DynamicSupervisor.start_child(:global.whereis_name(:packer_supervisor), child_spec)
  end

  def remove_packer(packer_pid) do
    DynamicSupervisor.terminate_child(:global.whereis_name(:packer_supervisor), packer_pid)
  end

  def children do
    DynamicSupervisor.which_children(:global.whereis_name(:packer_supervisor))
  end

  def count_children do
    DynamicSupervisor.count_children(:global.whereis_name(:packer_supervisor))
  end

end

The issue seems to be that the supervisor is started on both nodes. What would be the best way to handle this? I really need the supervisor to be dynamic so I can manage the worker modules effectively. Possibly a different registry?

Thanks for your help.

like image 256
Botonomous Avatar asked Oct 01 '18 13:10

Botonomous


3 Answers

If you want rather simple solution that works with global process registry, you can change your dynamic supervisor start_link

defmodule EcompackingCore.PackerSupervisor do
  use DynamicSupervisor
  require Logger

  def start_link() do
    case DynamicSupervisor.start_link(__MODULE__, :ok, name: {:global, :packer_supervisor}) do
      {:ok, pid} ->
        {:ok, pid}
      {:error, {:already_started, pid}} ->
        # you need this pid so on each node supervisor 
        # of this dynamic supervisor can monitor this same pid
        # so each node tracks existence of your process
        {:ok, pid}
      any -> any
    end
  end

  def init(:ok) do
    Logger.info("Starting Packer Supervisor")
    DynamicSupervisor.init(strategy: :one_for_one)
  end

  def add_packer(badge_id, packer_name) do
    child_spec = {EcompackingCore.Packer, {badge_id, packer_name}}
    DynamicSupervisor.start_child(:global.whereis_name(:packer_supervisor), child_spec)
  end

  def remove_packer(packer_pid) do
    DynamicSupervisor.terminate_child(:global.whereis_name(:packer_supervisor), packer_pid)
  end

  def children do
    DynamicSupervisor.which_children(:global.whereis_name(:packer_supervisor))
  end

  def count_children do
    DynamicSupervisor.count_children(:global.whereis_name(:packer_supervisor))
  end
end

What many people will say about this solution is that you should not do this since in case of network split, you end-up with two or more global processes in cluster. But you can even deal with this if you implement some node monitoring/tracker so you know how much nodes you "see" in cluster.

For instance, if cluster size iz 5, then you can create check rule that will check if you see 3+ nodes, if not, then you will schedule next start in say 1 second and try to register again globally your dynamic supervisor until check rule returns true (meaning you are in majority group and you can offer consistency across that group). On the other hand if your node is minority group and already holds global dynamic supervisor then shut it down and schedule start in 1 second.

That is simplest way to achieve consistency across the cluster, but there is one thing you should consider. This Dynamic Supervisor will start workers on single node which I'm sure you don't want to, so rather use global registry and some load balancing algorithm to balance processes that should be started in local supervisors.

  • Swarm has builtin ring and static quorum ring algorithms in place, but that uses hashing to distribute load across cluster. It is good solution if your workers have ID for which you can calculate hash.
  • Syn is another alternative.
like image 186
Milan Jaric Avatar answered Oct 07 '22 10:10

Milan Jaric


After a bit of research I found a solution:

I am now using https://github.com/bitwalker/swarm to handle the pid registration. This allows setting up processes across a cluster and offers hand off support if one of the nodes goes down.

like image 42
Botonomous Avatar answered Oct 07 '22 09:10

Botonomous


You can use a simple central node to monitor other nodes, of course, one supervisor.

This central node only does start, monitor, and use a database to save other nodes' status and pid.

When a node joins and downs, you can receive its down message and handle it(update database).

The only drawback of this method is that you can only have one central node, but as this node does simple things, it is almost stable, which runs on our production system for a year.

like image 27
chris Avatar answered Oct 07 '22 10:10

chris