Running a standalone Hadoop application on multiple CPU cores

Question

My team built a Java application using the Hadoop libraries to transform a bunch of input files into useful output. Given the current load a single multicore server will do fine for the coming year or so. We do not (yet) have the need to go for a multiserver Hadoop cluster, yet we chose to start this project "being prepared".

When I run this app on the command-line (or in eclipse or netbeans) I have not yet been able to convince it to use more that one map and/or reduce thread at a time. Given the fact that the tool is very CPU intensive this "single threadedness" is my current bottleneck.

When running it in the netbeans profiler I do see that the app starts several threads for various purposes, but only a single map/reduce is running at the same moment.

The input data consists of several input files so Hadoop should at least be able to run 1 thread per input file at the same time for the map phase.

What do I do to at least have 2 or even 4 active threads running (which should be possible for most of the processing time of this application)?

I'm expecting this to be something very silly that I've overlooked.

I just found this: https://issues.apache.org/jira/browse/MAPREDUCE-1367 This implements the feature I was looking for in Hadoop 0.21 It introduces the flag mapreduce.local.map.tasks.maximum to control it.

For now I've also found the solution described here in this question.

wlk · Accepted Answer

I'm not sure if I'm correct, but when you are running tasks in local mode, you can't have multiple mappers/reducers.

Anyway, to set maximum number of running mappers and reducers use configuration options mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum by default those options are set to 2, so I might be right.

Finally, if you want to be prepared for multinode cluster go straight with running this in fully-distributed way, but have all servers (namenode, datanode, tasktracker, jobtracker, ...) run on a single machine

oae · Answer

Just for clarification... If hadoop runs in local mode you don't have parallel execution on a task level (except you're running >= hadoop 0.21 (MAPREDUCE-1367)). Though you can submit multiple jobs at once and these getting executed in parallel then.

All those

mapred.tasktracker.{map|reduce}.tasks.maximum

properties do only apply to the hadoop running in distributed mode!

HTH Joahnnes

All those

mapred.tasktracker.{map|reduce}.tasks.maximum

properties do only apply to the hadoop running in distributed mode!

HTH Joahnnes

Running a standalone Hadoop application on multiple CPU cores

Tags:

java

command-line

multithreading

hadoop

mapreduce

Niels Basjes

2 Answers

wlk

oae

Recent Activity

Donate For Us

Running a standalone Hadoop application on multiple CPU cores

Tags:

java

command-line

multithreading

hadoop

mapreduce

Niels Basjes

2 Answers

wlk

oae

Related questions

Recent Activity

Donate For Us