Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

makeCluster function in R snow hangs indefinitely

I am using makeCluster function from R package snow from Linux machine to start a SOCK cluster on a remote Linux machine. All seems settled for the two machines to communicate succesfully (I am able to estabilish ssh connections between the two). But:

makeCluster("192.168.128.24",type="SOCK")

does not throw any result, just hangs indefinitely.

What am I doing wrong?

Thanks a lot

like image 373
Aaron Iemma Avatar asked Jan 13 '23 18:01

Aaron Iemma


1 Answers

Unfortunately, there are a lot of things that can go wrong when creating a snow (or parallel) cluster object, and the most common failure mode is to hang indefinitely. The problem is that makeSOCKcluster launches the cluster workers one by one, and each worker (if successfully started) must make a socket connection back to the master before the master proceeds to launch the next worker. If any of the workers fail to connect back to the master, makeSOCKcluster will hang without any error message. The worker may issue an error message, but by default any error message is redirected to /dev/null.

In addition to ssh problems, makeSOCKcluster could hang because:

  • R not installed on a worker machine
  • snow not installed on a the worker machine
  • R or snow not installed in the same location as the local machine
  • current user doesn't exist on a worker machine
  • networking problem
  • firewall problem

and there are many more possibilities.

In other words, no one can diagnose this problem without further information, so you have to do some troubleshooting in order to get that information.

In my experience, the single most useful troubleshooting technique is manual mode which you enable by specifying manual=TRUE when creating the cluster object. It's also a good idea to set outfile="" so that error messages from the workers aren't redirected to /dev/null:

cl <- makeSOCKcluster("192.168.128.24", manual=TRUE, outfile="")

makeSOCKcluster will display an Rscript command to execute in a terminal on the specified machine, and then it will wait for you to execute that command. In other words, makeSOCKcluster will hang until you manually start the worker on host 192.168.128.24, in your case. Remember that this is a troubleshooting technique, not a solution to the problem, and the hope is to get more information about why the workers aren't starting by trying to start them manually.

Obviously, the use of manual mode bypasses any ssh issues (since you're not using ssh), so if you can create a SOCK cluster successfully in manual mode, then probably ssh is your problem. If the Rscript command isn't found, then either R isn't installed, or it's installed in a different location. But hopefully you'll get some error message that will lead you to the solution.

If makeSOCKcluster still just hangs after you've executed the specified Rscript command on the specified machine, then you probably have a networking or firewall issue.

For more troubleshooting advice, see my answer for making cluster in doParallel / snowfall hangs.

like image 138
Steve Weston Avatar answered Jan 19 '23 01:01

Steve Weston