Erlang: Distributed Application Strange behaviour

Question

I'm paying with distributed erlang applications.

Configuration and ideas are taken from:
http:/www.erlang.org/doc/pdf/otp-system-documentation.pdf 9.9. Distributed Applications

We have 3 nodes: n1@a2-X201, n2@a2-X201, n3@a2-X201
We have application wd that do some useful job :)

Configuration files:

wd1.config - for the first node:

      [{kernel,
          [{distributed,[{wd,5000,['n1@a2-X201',{'n2@a2-X201','n3@a2-X201'}]}]},
           {sync_nodes_mandatory,['n2@a2-X201','n3@a2-X201']},
           {sync_nodes_timeout,5000}
        ]}
      ,{sasl, [
      %% All reports go to this file
      {sasl_error_logger,{file,"/tmp/wd_n1.log"}}
      ]
    }].

wd2.config for the second:

    [{kernel,
        [{distributed,[{wd,5000,['n1@a2-X201',{'n2@a2-X201','n3@a2-X201'}]}]},
         {sync_nodes_mandatory,['n1@a2-X201','n3@a2-X201']},
         {sync_nodes_timeout,5000}
         ]
     }
    ,{sasl, [
        %% All reports go to this file
        {sasl_error_logger,{file,"/tmp/wd_n2.log"}}
    ]
    }].

For the node n3 looks similar.

Now start erlang in 3 separate terminals:

erl -sname n1@a2-X201 -config wd1 -pa $WD_EBIN_PATH -boot start_sasl
erl -sname n2@a2-X201 -config wd2 -pa $WD_EBIN_PATH -boot start_sasl
erl -sname n3@a2-X201 -config wd3 -pa $WD_EBIN_PATH -boot start_sasl

Start application on each of erlang nodes: * application:start(wd).

(n1@a2-X201)1> application:start(wd).

=INFO REPORT==== 19-Jun-2011::15:42:51 ===
wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\.SIG$" 
ok

(n2@a2-X201)1> application:start(wd).
ok
(n2@a2-X201)2>

(n3@a2-X201)1> application:start(wd).
ok
(n3@a2-X201)2>

At the moment everything is Ok. As written in Erlang documentation: Application is running at node n1@a2-X201

Now kill node n1: Application was migrated to n2

(n2@a2-X201)2> 
=INFO REPORT==== 19-Jun-2011::15:46:28 ===
wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\.SIG$"

Continue our game: kill node n2 One more time system works fine. We have our application at node n3

(n3@a2-X201)2> 
=INFO REPORT==== 19-Jun-2011::15:48:18 ===
wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\.SIG$"

Now restore nodes n1 and n2. So:

Erlang R14B (erts-5.8.1) [source] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.8.1  (abort with ^G)
(n1@a2-X201)1> 

Eshell V5.8.1  (abort with ^G)
(n2@a2-X201)1>

Nodes n1 and n2 are back.
Looks like now I have to restart application manually: * Let's do it at node n2 first:

(n2@a2-X201)1> application:start(wd).

Looks like it hanged ...
Now restart it at n1

(n1@a2-X201)1> application:start(wd).

=INFO REPORT==== 19-Jun-2011::15:55:43 ===
wd_plug_server starting... PluginId: 4 Path: "/home/a2/src/erl/data/SIG" FileMask: "(?i)(.*)\.SIG$" 

ok
(n1@a2-X201)2>

It works. And node n2 also has returned OK:

Eshell V5.8.1  (abort with ^G)
(n2@a2-X201)1> application:start(wd).
ok
(n2@a2-X201)2>

At node n3 we see:

=INFO REPORT==== 19-Jun-2011::15:55:43 ===
    application: wd
    exited: stopped
    type: temporary

In general, everything looks ok, as written in documentation, except for delay with starting application at node n2.

Now kill node n1 once more:

(n1@a2-X201)2> 
User switch command
 --> q
[a2@a2-X201 releases]$

Ops ... everything hangs. Application was not restarted at another node.

Actually, while I was writing this post I've realized that sometime everything id Ok, sometime I have a problem.

Any ideas, While there could be problems when restoring "primary" node nd killing it one more time?

legoscia · Accepted Answer

As explained over at Learn You Some Erlang (scroll to the bottom), distributed applications only work well when started as part of a release, not when you start them manually with application:start.

Erlang: Distributed Application Strange behaviour

Tags:

erlang

erlang-otp

Anton Prokofiev

1 Answers

legoscia

Recent Activity

Donate For Us

Erlang: Distributed Application Strange behaviour

Tags:

erlang

erlang-otp

Anton Prokofiev

1 Answers

legoscia

Related questions

Recent Activity

Donate For Us