Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Run a hadoop cluster on docker containers

I want to run a multi-node hadoop cluster, with each node inside a docker container on a different host. This image - https://github.com/sequenceiq/hadoop-docker works well to start hadoop in a pseudo distributed mode, what is the easiest way to modify this to have each node in a different container on a separate ec2 host?

like image 569
user1016313 Avatar asked Nov 19 '14 03:11

user1016313


People also ask

Can I run Hadoop cluster in Docker?

Set up the Hadoop cluster using Docker Alternatively, you can use Docker, run the Hadoop images directly on your Docker engine, and set up a Hadoop cluster.

Can Hadoop be containerized?

Hadoop daemons must be containerized to enable immutable and repeatable deployments. Cluster operations must be modeled using declarative concepts (instead of action-based imperative models) Any host in the cluster must be easily replaceable upon failure or degradation.

Is Docker good for scaling?

The Docker Swarm cluster manager offers clustering, scheduling, and integration capabilities that let developers build and ship multi-container/multi-host distributed applications. It includes all of the necessary scaling and management for container-based systems.


2 Answers

I did this with two containers running master and slave nodes on two different ubuntu hosts. I did the networking between containers using weave. I have added the images of the containers on docker hub account div4. I installed hadoop in the same way, as its installed on different hosts. I have added the two images with coomands to run haddop on them here:

https://registry.hub.docker.com/u/div4/hadoop_master/ https://registry.hub.docker.com/u/div4/hadoop_slave/.

like image 150
div Avatar answered Sep 28 '22 12:09

div


The people from sequenceiq have created a new project called cloud-break that is designed to work with different cloud providers and create hadoop clusters on them easily. You just have to enter your credentials and then it works the same for all providers, as far as I can see.

So for ec2, this will now probably be the easiest solution(especially because of a nice GUI):

https://github.com/sequenceiq/cloudbreak-deployer

like image 42
SGer Avatar answered Sep 28 '22 11:09

SGer