I've got two servers on a LAN with fresh installs of Centos 6.4 minimal and R 3.0.1. Both computers have doParallel, snow, and snowfall packages installed.
The servers can ssh to each other fine.
When I attempt to make clusters in either direction, I get a prompt for a password, but after entering the password, it just hangs there indefinately.
makePSOCKcluster("192.168.1.1",user="username")
How can I troubleshoot this?
edit:
I also tried calling makePSOCKcluster on the above-mentioned computer with a host that IS capable of being used as a slave (from other computers), but it still hangs. So, is it possible there is a firewall issue? I also tried using makePSOCKcluster with port 22:
> makePSOCKcluster("192.168.1.1",user="username",port=22)
Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, :
cannot open the connection
In addition: Warning message:
In socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, :
port 22 cannot be opened
here's my iptables
# Firewall configuration written by system-config-firewall
# Manual customization of this file is not recommended.
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
COMMIT
You could start by setting the "outfile" option to an empty string when creating the cluster object:
makePSOCKcluster("192.168.1.1",user="username",outfile="")
This allows you to see error messages from the workers in your terminal, which will hopefully provide a clue to the problem. If that doesn't help, I recommend using manual mode:
makePSOCKcluster("192.168.1.1",user="username",outfile="",manual=TRUE)
This bypasses ssh, and displays commands for you to execute in order to manually start each of the workers in separate terminals. This can uncover problems such as R packages that are not installed. It also allows you to debug the workers using whatever debugging tools you choose, although that takes a bit of work.
If makePSOCKcluster
doesn't respond after you execute the specified command, it means that the worker wasn't able to connect to the master process. If the worker doesn't display any error message, it may indicate a networking problem, possibly due to a firewall blocking the connection. Since makePSOCKcluster
uses a random port by default in R 3.X, you should specify an explicit value for port and configure your firewall to allow connections to that port.
To test for networking or firewall problems, you could try connecting to the master process using "netcat". Execute makePSOCKcluster
in manual mode, specifying the hostname of the desired worker host and the port on local machine that should allow incoming connections:
> library(parallel)
> makePSOCKcluster("node03", port=11234, manual=TRUE)
Manually start worker on node03 with
'/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=node01
PORT=11234 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
Now start a terminal session on "node03" and execute "nc" using the indicated values of "MASTER" and "PORT" as arguments:
node03$ nc node01 11234
The master process should immediately return with the message:
socket cluster with 1 nodes on host ‘node03’
while netcat should display no message, since it is quietly reading from the socket connection.
However, if netcat displays the message:
nc: getaddrinfo: Name or service not known
then you have a hostname resolution problem. If you can find a hostname that does work with netcat, you may be able to get makePSOCKcluster
to work by specifying that name via the "master" option: makePSOCKcluster("node03", master="node01", port=11234)
.
If netcat returns immediately, that may indicate that it wasn't able to connect to the specified port. If it returns after a minute or two, that may indicate that it wasn't able to communicate with specified host at all. In either case, check netcat's return value to verify that it was an error:
node03$ echo $?
1
Hopefully that will give you enough information about the problem that you can get help from a network administrator.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With