I was wondering about how to set up a h2o cluster using multiple AWS EC2 instances and R-Studio. I am not a computer scientist, so sorry for the trivial questions (!)
Based on this tutorial (http://amunategui.github.io/h2o-on-aws/) I sucessfully installed h2o and R-Studio on an AWS EC2 instance (Linux). But I rather want to create a multi-instance cluster with lets say 4 instance with 8 cores each.
Following this (http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/5/docs-website/deployment/multinode.html) document, I need a flatfile.txt where I can list all IPs and ports of each EC2 instance. In a next step, I have to copy this file to each node in the cluster and afterwards I need to start a cluster via the java command line... Since I am not a computer scientist as I already mentioned, some questions emerged:
Thank you so much in advance!
1a. Where to get the IPs? You get told them as you create each EC2 instance. It is the private IP you want (normally starting with 172.) (BTW, make sure you create them all in the same availability zone.)
1b. Use 54321 as the port. So your flatfile.txt for 3-nodes might look like:
172.31.1.123:54321
172.31.2.237:54321
172.44.99.99:54321
_2. You might make the flatfile.txt on your notebook, then scp it to each node, in your home directory. (Use the public IP for scp.)
_3. ssh in to each machine in turn, and then type that command, from the home directory, E.g.
java -Xmx20g -jar h2o.jar -flatfile flatfile.txt -port 54321
_4. First make sure port 8787 is open in your Amazon firewall (aka "security group"). Once you've made sure the H2O cluster is running (and assuming you have installed the H2O R package, and made sure it is exactly the same version as on each node in your cluster) then you simply do:
library(h2o)
h2o.init()
The h2o.init()
looks on the local machine for any node in the cluster.
Aside:
What I have been using are the scripts found here:
https://github.com/h2oai/h2o-3/tree/master/ec2
They do almost all the steps for you, including making the flatfile, distributing it, and starting H2O on each node. You still need to set up a security group (well, optionally, I suppose: the script default is to have no security group!), and you need to set a password for the user you will use to login to RStudio with. And you need to install the H2O R package (I think that could be done from inside RStudio, if you have an aversion to the commandline).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With