In my distributed systems course, we we began discussing the map reduce model of distributed computation. What are the benefits of having more reducers than mappers in map-reduce architectures?
Note: Google searching for this question provides conflicting opinions on this matter.
Suppose your data size is small, then you don't need so many mappers running to process the input files in parallel.
However, if the <key,value>
pairs generated by the mappers are large & diverse, then it makes sense to have more reducers because you can process more number of <key,value>
pairs in parallel.
Lets consider a case where your mapper output has 10 keys, with 100 values associated with each key, so if you have 10 reducers, you can process all the keys in parallel.
Now suppose your mappers output 100 keys with 10 values in each key. Then having 100 reducers will process all your keys in parallel. (Of course there will be network costs involved with having 100 reducers running at once)
So based on the type of data that your mappers output, you can decide on the optimal number of reducers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With