Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Erlang: Cannot start slave - {error,timeout}

I'm currently trying to set up a distributed Tsung load testing environment which uses the Erlang slave functionality, however I have been unsuccessful in getting the controller node to start a slave node. E.g.

(musicglue@load1)1> net:ping(musicglue@load2).
pong
(musicglue@load1)2> slave:start(load2,musicglue,"-setcookie tom").
{error,timeout}

BACKGROUND

My env:

Controller - hostname: load1, user: musicglue, Ubuntu 10.04 LTS, Erlang R15B01 compiled from source Slave - hostname: load2, user: musicglue, Ubuntu 10.04 LTS, Erlang R15B01 complied from source Firewall disabled SELinux not installed

Things that are working:

  • I can SSH from load1 onto load2 and vice versa
  • I can start an erl sessions on load1 and load2
  • I can start an erl session on load2 from load1; ssh load2 erl
  • I can successfully ping load2 from load1 from an erl session using the same cookie on both nodes.

Ping output:

musicglue@load1:~$ erl -rsh ssh -sname musicglue -setcookie tom
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:4:4] [async-threads:
0] [hipe] [kernel-poll:false]
Eshell V5.9.1  (abort with ^G)
(musicglue@load1)1> net:ping(musicglue@load2).
pong

THE ISSUE

My problem occurs when attempting to start a slave session from load1 on load2:

musicglue@load1:~$ erl -rsh ssh -sname musicglue -setcookie tom
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:4:4] [async-threads:
0] [hipe] [kernel-poll:false]

Eshell V5.9.1  (abort with ^G)
(musicglue@load1)1> net:ping(musicglue@load2).
pong
(musicglue@load1)2> slave:start(load2,musicglue,"-setcookie
tom").
{error,timeout}

Here is the output I get from epmd when I run the slave:start command:

epmd: Thu May 24 10:01:57 2012: Non-local peer connected
epmd: Thu May 24 10:01:57 2012: opening connection on file descriptor
4
epmd: Thu May 24 10:01:57 2012: got 12 bytes
***** 00000000  00 0a 7a 6d 75 73 69 63 67 6c 75 65
|..zmusicglue|
epmd: Thu May 24 10:01:57 2012: ** got PORT2_REQ
epmd: Thu May 24 10:01:57 2012: got 2 bytes
***** 00000000  77 01                                             |w.|
epmd: Thu May 24 10:01:57 2012: ** sent PORT2_RESP (error) for
"musicglue"
epmd: Thu May 24 10:01:57 2012: closing connection on file descriptor
4
epmd: Thu May 24 10:01:57 2012: Local peer connected
epmd: Thu May 24 10:01:57 2012: opening connection on file descriptor
4
epmd: Thu May 24 10:01:57 2012: got 24 bytes
***** 00000000  00 16 78 ca d6 4d 00 00  05 00 05 00 09 6d 75 73
|..x..M.......mus|
***** 00000010  69 63 67 6c 75 65 00 00                           |
icglue..|
epmd: Thu May 24 10:01:57 2012: ** got ALIVE2_REQ
epmd: Thu May 24 10:01:57 2012: registering 'musicglue:1', port 51926
epmd: Thu May 24 10:01:57 2012: type 77 proto 0 highvsn 5 lowvsn 5
epmd: Thu May 24 10:01:57 2012: got 4 bytes
***** 00000000  79 00 00 01                                       |
y...|
epmd: Thu May 24 10:01:57 2012: ** sent ALIVE2_RESP for "musicglue"
epmd: Thu May 24 10:01:57 2012: unregistering 'musicglue:1', port
51926
epmd: Thu May 24 10:01:57 2012: closing connection on file descriptor
4

Any help or suggestions anyone has would be much appreciated,

Many thanks

EDIT

I should also mention that I can see the ssh connection being successfully acknowledged by load2 but then immediately disconnecting:

May 30 13:49:27 load2 sshd[16169]: Accepted publickey for musicglue from 173.45.236.182 port 51843 ssh2
May 30 13:49:27 load2 sshd[16171]: Received disconnect from 173.45.236.182: 11: disconnected by user

In response to below comments I have also tried to start the slave using different node names for the slave:

musicglue@load1:~$ erl -rsh ssh -sname musicglue -setcookie tom
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:4:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.1  (abort with ^G)
(musicglue@load1)1> slave:start(load2,bar,"-setcookie tom").
{error,timeout}

and for the controller:

musicglue@load1:~$ erl -rsh ssh -sname foo -setcookie tom
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:4:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.1  (abort with ^G)
(foo@load1)1> slave:start(load2,musicglue,"-setcookie tom").
{error,timeout}

and for both:

musicglue@load1:~$ erl -rsh ssh -sname foo -setcookie tom
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:4:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.1  (abort with ^G)
(foo@load1)1> slave:start(load2,bar,"-setcookie tom").
{error,timeout}

But to no avail

SOLUTION

Turns out that my problem was that my slave was unable to SSH onto the controller and therefore could not respond to any commands.

After fixing this port of communication between the two nodes everyone worked perfectly.

like image 707
Tom Maguire Avatar asked May 29 '12 09:05

Tom Maguire


2 Answers

An alternate answer for those who find this question via Google. If you're trying to start a service on a separate machine then your controller node name must resolve.

For example, I was having timeouts with:

> node().
[email protected]
> slave:start('192.168.122.196',bar,"-setcookie cookie").
{error,timeout}

By starting my erlang instance with an explicit domain name:

erl -name [email protected] -setcookie cookie
> slave:start('192.168.122.196',bar,"-setcookie cookie").

This command now succeeds.

like image 179
Thomas M. DuBuisson Avatar answered Nov 20 '22 11:11

Thomas M. DuBuisson


Try logging what goes on through SSH by creating a shell script like this somewhere in your PATH:

#!/bin/sh

echo "$0" "$@" > /tmp/my-ssh.log
ssh -v "$@"  2>&1 | tee -a /tmp/my-ssh.log

Call it my-ssh, start Erlang with erl -rsh my-ssh, and check what goes into /tmp/my-ssh.log. That should shed some light on the problem...

like image 32
legoscia Avatar answered Nov 20 '22 11:11

legoscia