Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi node cluster installation with h2o on AWS EC2

I was wondering about how to set up a h2o cluster using multiple AWS EC2 instances and R-Studio. I am not a computer scientist, so sorry for the trivial questions (!)

Based on this tutorial (http://amunategui.github.io/h2o-on-aws/) I sucessfully installed h2o and R-Studio on an AWS EC2 instance (Linux). But I rather want to create a multi-instance cluster with lets say 4 instance with 8 cores each.

Following this (http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/5/docs-website/deployment/multinode.html) document, I need a flatfile.txt where I can list all IPs and ports of each EC2 instance. In a next step, I have to copy this file to each node in the cluster and afterwards I need to start a cluster via the java command line... Since I am not a computer scientist as I already mentioned, some questions emerged:

  1. Where do I find the IPs and ports of each h2o instance?
  2. How exactly can I copy the resulting file to each node?
  3. From step 5 on I am completely confused; where do I have to insert this line / where can I find the java comand line?
  4. I dont want to use the Web UI of h2o, so how can I access the cluster from R-Studio (installed on one of the instances) ?

Thank you so much in advance!

like image 418
constiii Avatar asked Jul 13 '16 12:07

constiii


1 Answers

1a. Where to get the IPs? You get told them as you create each EC2 instance. It is the private IP you want (normally starting with 172.) (BTW, make sure you create them all in the same availability zone.)

1b. Use 54321 as the port. So your flatfile.txt for 3-nodes might look like:

172.31.1.123:54321
172.31.2.237:54321
172.44.99.99:54321

_2. You might make the flatfile.txt on your notebook, then scp it to each node, in your home directory. (Use the public IP for scp.)

_3. ssh in to each machine in turn, and then type that command, from the home directory, E.g.

 java -Xmx20g -jar h2o.jar -flatfile flatfile.txt -port 54321

_4. First make sure port 8787 is open in your Amazon firewall (aka "security group"). Once you've made sure the H2O cluster is running (and assuming you have installed the H2O R package, and made sure it is exactly the same version as on each node in your cluster) then you simply do:

library(h2o)
h2o.init()

The h2o.init() looks on the local machine for any node in the cluster.


Aside:

What I have been using are the scripts found here:

https://github.com/h2oai/h2o-3/tree/master/ec2

They do almost all the steps for you, including making the flatfile, distributing it, and starting H2O on each node. You still need to set up a security group (well, optionally, I suppose: the script default is to have no security group!), and you need to set a password for the user you will use to login to RStudio with. And you need to install the H2O R package (I think that could be done from inside RStudio, if you have an aversion to the commandline).

like image 167
Darren Cook Avatar answered Nov 05 '22 06:11

Darren Cook