Hadoop: Number of mappers and reducers

Tags:

mapreduce

I ran Hadoop MapReduce on 1.1GB file multiple times with a different number of mappers and reducers (e.g. 1 mapper and 1 reducer, 1 mapper and 2 reducers, 1 mapper and 4 reducers, ...)

Hadoop is installed on quad-core machine with hyper-threading.

The following is the top 5 result sorted by shortest execution time:

+----------+----------+----------+
|  time    | # of map | # of red |
+----------+----------+----------+
| 7m 50s   |    8     |    2     |
| 8m 13s   |    8     |    4     |
| 8m 16s   |    8     |    8     |
| 8m 28s   |    4     |    8     |
| 8m 37s   |    4     |    4     |
+----------+----------+----------+

Edit

The result for 1 - 8 reducers and 1 - 8 mappers: column = # of mappers row = # of reducers

+---------+---------+---------+---------+---------+
|         |    1    |    2    |    4    |    8    |
+---------+---------+---------+---------+---------+
|    1    |  16:23  |  13:17  |  11:27  |  10:19  |
+---------+---------+---------+---------+---------+
|    2    |  13:56  |  10:24  |  08:41  |  07:52  |
+---------+---------+---------+---------+---------+
|    4    |  14:12  |  10:21  |  08:37  |  08:13  |  
+---------+---------+---------+---------+---------+
|    8    |  14:09  |  09:46  |  08:28  |  08:16  |
+---------+---------+---------+---------+---------+

(1) It looks that the program runs slightly faster when I have 8 mappers, but why does it slow down as I increase the number of reducers? (e.g. 8mappers/2reducers is faster than 8mappers/8reducers)

(2) When I use only 4 mappers, it's a bit slower simply because I'm not utilizing the other 4 cores, right?

917

asked Dec 01 '13 00:12

1 Answers

The optimal number of mappers and reducers has to do with a lot of things.

The main thing to aim for is the balance between the used CPU power, amount of data that is transported (in mapper, between mapper and reducer, and out the reducers) and the disk 'head movements'.

Each task in a mapreduce job works best if it can read/write the data 'with minimal disk head movements'. Usually described as "sequential reads/writes". But if the task is CPU bound the extra diskhead movements do not impact the job.

It seems to me that in this specific case you have

a mapper that does quite a bit of CPU cycles (i.e. more mappers make it go faster because the CPU is the bottle neck and the disks can keep up in providing the input data).
a reducer that does almost no CPU cycles and is mostly IO bound. This causes that with a single reducer you are still CPU bound, yet with 4 or more reducers you seem to be IO bound. So 4 reducers cause the disk head to move 'too much'.

Possible ways to handle this kind of situation:

First do exactly what you did: Do some test runs and see which setting performs best given this specific job and your specific cluster.

Then you have three options:

Accept the situation you have
Shift load from CPU to disk or the other way around.
Get a bigger cluster: More CPUs and/or more disks.

Suggestions for shifting the load:

If CPU bound and all CPUs are fully loaded then reduce the CPU load:
- Check for needless CPU cycles in your code.
- Switch to a 'lower CPU impact' compression codec: I.e. go from GZip to Snappy or to 'no compression'.
- Tune the number of mappers/reducers in your job.
If IO bound and you have some CPU capacity left:
- Enable compression: This makes the CPUs work a bit harder and reduces the work the disks have to do.
- Experiment with various compression codecs (I recommend sticking with either Snappy or Gzip ... I very often go with Gzip).
- Tune the number of mappers/reducers in your job.

answered Oct 04 '22 02:10

Niels Basjes

Related questions
                            
                                How does Hive decide when to use map reduce and when not to?
                            
                                Requests hang when using Hiveserver2 Thrift Java client
                            
                                Hive Buckets-understanding TABLESAMPLE(BUCKET X OUT OF Y)
                            
                                Messed up sed syntactics in hadoop startup script after reinstalling JVM
                            
                                build hadoop 2.2 on windows
                            
                                HDFS file watcher
                            
                                Tuning Hive Queries That Uses Underlying HBase Table
                            
                                Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password) during ambari hadoop installation
                            
                                Concat Avro files using avro-tools
                            
                                Is there a way to transpose data in Hive
                            
                                Spark with HBASE vs Spark with HDFS
                            
                                Hive: SELECT AS and GROUP BY
                            
                                How Java Hadoop Mapper can send multiple values
                            
                                HDFS error put: `input': No such file or directory
                            
                                Apache Hadoop vs Google Bigdata
                            
                                Hadoop Reducer Values in Memory?
                            
                                Loading csv data into Hbase [closed]
                            
                                Difference between 3 memory parameters in Hadoop 2?
                            
                                Create HIVE Table with multi character delimiter
                            
                                How to increase the number of containers in nodemanager in YARN

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hadoop: Number of mappers and reducers

Tags:

hadoop

mapreduce

Edit

kabichan

People also ask

1 Answers

Niels Basjes

Recent Activity

Donate For Us