Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I install Hadoop and Pydoop on a fresh Ubuntu instance

Most of the setup instructions I see are verbose. Is there a near script-like set of commands that we can just execute to set up Hadoop and Pydoop on an Ubuntu instance on Amazon EC2?

like image 963
S Anand Avatar asked Dec 20 '22 05:12

S Anand


1 Answers

Another solution would be to use Juju (Ubuntu's service orchestration framework).

First install the Juju client on your standard computer:

sudo add-apt-repository ppa:juju/stable
sudo apt-get update && sudo apt-get install juju-core

(instructions for MacOS and Windows are also available here)

Then generate a configuration file

juju generate-config

And modify it with your preferred cloud credentials (AWS, Azure, GCE...). Based on the naming for m3.medium, I assume you use AWS hence follow these instructions

Note: The above has to be done only once.

Now bootstrap

 juju bootstrap amazon

Deploy a GUI (optional) like the demo available on the website

juju deploy --to 0 juju-gui && juju expose juju-gui

You'll find the URL of the GUI and password with:

juju api-endpoints | cut -f1 -d":"
cat ~/.juju/environments/amazon.jenv | grep pass

Note that the above steps are preliminary to any Juju deployment, and can be re-used everytime you want to spin the environment.

Now comes your use case with Hadoop. You have several options.

  1. Just deploy 1 node of Hadoop

    juju deploy --constraints "cpu-cores=2 mem=4G root-disk=20G" hadoop
    

You can track the deployment with

juju debug-log

and get info about the new instances with

juju status

This is the only command you'll need to deploy Hadoop (you could consider Juju as an evolution of apt for complex systems)

  1. Deploy a cluster of 3 nodes with HDFS and MapReduce

    juju deploy hadoop hadoop-master
    juju deploy hadoop hadoop-slavecluster
    juju add-unit -n 2 hadoop-slavecluster
    juju add-relation hadoop-master:namenode hadoop-slavecluster:datanode
    juju add-relation hadoop-master:resourcemanager hadoop-slavecluster:nodemanager
    
  2. Scale out usage (separate HDFS & MapReduce, experimental)

    juju deploy hadoop hdfs-namenode
    juju deploy hadoop hdfs-datacluster
    juju add-unit -n 2 hdfs-datacluster
    juju add-relation hdfs-namenode:namenode hdfs-datacluster:datanode
    juju deploy hadoop mapred-resourcemanager
    juju deploy hadoop mapred-taskcluster
    juju add-unit -n 2 mapred-taskcluster
    juju add-relation mapred-resourcemanager:mapred-namenode hdfs-namenode:namenode
    juju add-relation mapred-taskcluster:mapred-namenode hdfs-namenode:namenode
    juju add-relation mapred-resourcemanager:resourcemanager mapred-taskcluster:nodemanager
    

For Pydoop, you'll have to deploy it manually as in the first answer (you have access to the Juju instances via "juju ssh "), or you can write a "charm" (a method for Juju to learn how to deploy pydoop).

like image 66
Samuel Cozannet Avatar answered Dec 21 '22 20:12

Samuel Cozannet