In a typical MapReduce setup(like Hadoop), how many reducer is used for 1 task, for example, counting words? My understanding of that MapReduce from Google means only 1 reducer is involved. Is that correct?
For example, the word count will divide the input into N chunks, and N Map will be running, producing the (word,#) list. My question is, once the Map phase is done, will there be only ONE reducer instance running to compute the result? or there will be reducers running in parallel?
The simple answer is that the number of reducers does not have to be 1 and yes, reducers can run in parallel. As I mentioned above this is user defined or derived.
To keep things in context I will refer to Hadoop in this case so you have an idea of how things work. If you are using the streaming API in Hadoop (0.20.2) you will have to explicitly define how many reducers you would like to run since by default, only 1 reduce task will be launched. You do so by passing the number of reducers to the -D mapred.reduce.tasks=# of reducers
argument. The Java API will try to derive the number of reducers you will need but again you can explicitly set that too. In both cases, there is a hard cap on the number of reducers you can run per node and that is set in your mapred-site.xml
configuration file using mapred.tasktracker.reduce.tasks.maximum
.
On a more conceptual note, you can look at this post on the hadoop wiki that talks about choosing the number of map and reduce tasks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With