Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running a standalone Hadoop application on multiple CPU cores

My team built a Java application using the Hadoop libraries to transform a bunch of input files into useful output. Given the current load a single multicore server will do fine for the coming year or so. We do not (yet) have the need to go for a multiserver Hadoop cluster, yet we chose to start this project "being prepared".

When I run this app on the command-line (or in eclipse or netbeans) I have not yet been able to convince it to use more that one map and/or reduce thread at a time. Given the fact that the tool is very CPU intensive this "single threadedness" is my current bottleneck.

When running it in the netbeans profiler I do see that the app starts several threads for various purposes, but only a single map/reduce is running at the same moment.

The input data consists of several input files so Hadoop should at least be able to run 1 thread per input file at the same time for the map phase.

What do I do to at least have 2 or even 4 active threads running (which should be possible for most of the processing time of this application)?

I'm expecting this to be something very silly that I've overlooked.


I just found this: https://issues.apache.org/jira/browse/MAPREDUCE-1367 This implements the feature I was looking for in Hadoop 0.21 It introduces the flag mapreduce.local.map.tasks.maximum to control it.

For now I've also found the solution described here in this question.

like image 890
Niels Basjes Avatar asked Aug 04 '10 15:08

Niels Basjes


2 Answers

I'm not sure if I'm correct, but when you are running tasks in local mode, you can't have multiple mappers/reducers.

Anyway, to set maximum number of running mappers and reducers use configuration options mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum by default those options are set to 2, so I might be right.

Finally, if you want to be prepared for multinode cluster go straight with running this in fully-distributed way, but have all servers (namenode, datanode, tasktracker, jobtracker, ...) run on a single machine

like image 154
wlk Avatar answered Oct 23 '22 11:10

wlk


Just for clarification... If hadoop runs in local mode you don't have parallel execution on a task level (except you're running >= hadoop 0.21 (MAPREDUCE-1367)). Though you can submit multiple jobs at once and these getting executed in parallel then.

All those

mapred.tasktracker.{map|reduce}.tasks.maximum

properties do only apply to the hadoop running in distributed mode!

HTH Joahnnes

like image 2
oae Avatar answered Oct 23 '22 13:10

oae