Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running jobs parallely in hadoop

Tags:

hadoop

I am new to hadoop.

I have set up a 2 node cluster.

How to run 2 jobs parallely in hadoop.

When i submit jobs, they are running one by one in FIFO order. I have to run the jobs parallely. How to acheive that.

Thanks MRK

like image 647
MRK Avatar asked Sep 20 '11 10:09

MRK


People also ask

What is used to run multiple jobs in parallel in Hadoop?

If you use the HadoopActivity with either the FAIR scheduler or capacity scheduler, you can run multiple steps in parallel.

How Hadoop runs a MapReduce job?

MapReduce assigns fragments of data across the nodes in a Hadoop cluster. The goal is to split a dataset into chunks and use an algorithm to process those chunks at the same time. The parallel processing on multiple machines greatly increases the speed of handling even petabytes of data.

What is a job in MapReduce?

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.


3 Answers

Hadoop can be configured with a number of schedulers and the default is the FIFO scheduler.

FIFO Schedule behaves like this.

Scenario 1: If the cluster has 10 Map Task capacity and job1 needs 15 Map Task, then running job1 takes the complete cluster. As job1 makes progress and there are free slots available which are not used by job1 then job2 runs on the cluster.

Scenario 2: If the cluster has 10 Map Task capacity and job1 needs 6 Map Task, then job1 takes 6 slots and job2 takes 4 slots. job1 and job2 run in parallel.

To run jobs in parallel from the start, you can either configure a Fair Scheduler or a Capacity Scheduler based on your requirements. The mapreduce.jobtracker.taskscheduler and the specific scheduler parameters have to be set for this to take effect in the mapred-site.xml.

Edit: Updated the answer based on the comment from MRK.

like image 59
Praveen Sripati Avatar answered Oct 14 '22 23:10

Praveen Sripati


You have "Map Task Capacity" and "Reduce Task Capacity". Whenever those are free they would pick the job in FIFO order. Your submitted jobs contains mapper and optionally reducer. If your jobs mapper (and/or reducer) count is smaller then the cluster's capacity it would take the next jobs mapper (and/or reducer).

If you don't like FIFO, you can always give priority to your submitted jobs.

Edit:

Sorry about slight missinformation, Praveen's answer is the right one. in adition to his answer you can check HOD scheduler aswell.

like image 29
frail Avatar answered Oct 15 '22 00:10

frail


With the default scheduler only one job per user at a time. You can launch different jobs from different user ids. They will run in parallel, of course, as mentioned by others you need to have enough slot capacity.

like image 33
kiru Avatar answered Oct 15 '22 00:10

kiru