Most of the setup instructions I see are verbose. Is there a near script-like set of commands that we can just execute to set up Hadoop and Pydoop on an Ubuntu instance on Amazon EC2?
Another solution would be to use Juju (Ubuntu's service orchestration framework).
First install the Juju client on your standard computer:
sudo add-apt-repository ppa:juju/stable
sudo apt-get update && sudo apt-get install juju-core
(instructions for MacOS and Windows are also available here)
Then generate a configuration file
juju generate-config
And modify it with your preferred cloud credentials (AWS, Azure, GCE...). Based on the naming for m3.medium, I assume you use AWS hence follow these instructions
Note: The above has to be done only once.
Now bootstrap
juju bootstrap amazon
Deploy a GUI (optional) like the demo available on the website
juju deploy --to 0 juju-gui && juju expose juju-gui
You'll find the URL of the GUI and password with:
juju api-endpoints | cut -f1 -d":"
cat ~/.juju/environments/amazon.jenv | grep pass
Note that the above steps are preliminary to any Juju deployment, and can be re-used everytime you want to spin the environment.
Now comes your use case with Hadoop. You have several options.
Just deploy 1 node of Hadoop
juju deploy --constraints "cpu-cores=2 mem=4G root-disk=20G" hadoop
You can track the deployment with
juju debug-log
and get info about the new instances with
juju status
This is the only command you'll need to deploy Hadoop (you could consider Juju as an evolution of apt for complex systems)
Deploy a cluster of 3 nodes with HDFS and MapReduce
juju deploy hadoop hadoop-master
juju deploy hadoop hadoop-slavecluster
juju add-unit -n 2 hadoop-slavecluster
juju add-relation hadoop-master:namenode hadoop-slavecluster:datanode
juju add-relation hadoop-master:resourcemanager hadoop-slavecluster:nodemanager
Scale out usage (separate HDFS & MapReduce, experimental)
juju deploy hadoop hdfs-namenode
juju deploy hadoop hdfs-datacluster
juju add-unit -n 2 hdfs-datacluster
juju add-relation hdfs-namenode:namenode hdfs-datacluster:datanode
juju deploy hadoop mapred-resourcemanager
juju deploy hadoop mapred-taskcluster
juju add-unit -n 2 mapred-taskcluster
juju add-relation mapred-resourcemanager:mapred-namenode hdfs-namenode:namenode
juju add-relation mapred-taskcluster:mapred-namenode hdfs-namenode:namenode
juju add-relation mapred-resourcemanager:resourcemanager mapred-taskcluster:nodemanager
For Pydoop, you'll have to deploy it manually as in the first answer (you have access to the Juju instances via "juju ssh "), or you can write a "charm" (a method for Juju to learn how to deploy pydoop).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With