Now I have a 4-phase MapReduce job as follows:
Input-> Map1 -> Reduce1 -> Reducer2 -> Reduce3 -> Reduce4 -> Output
I notice that there is ChainMapper
class in Hadoop which can chain several mappers into one big mapper, and save the disk I/O cost between map phases. There is also a ChainReducer
class, however it is not a real "Chain-Reducer". It can only support jobs like:
[Map+/ Reduce Map*]
I know I can set four MR jobs for my task, and use default mappers for the last three jobs. But that will cost a lot of disk I/O, since reducers should write the result into disk to let the following mapper access it. Is there any other Hadoop built-in feature to chain my reducers to lower the I/O cost?
I am using Hadoop 1.0.4.
If there are lot of key-values to merge, a single reducer might take too much time. To avoid reducer machine becoming the bottleneck, we use multiple reducers. When you have multiple reducers, each node that is running mapper puts key-values in multiple buckets just after sorting.
setReducerClass in each main class you set each Reducer.
We use MultipleInputs class which supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path.
The default number of reducers for any job is 1. The number of reducers can be set in the job configuration.
I dont think that you can have the o/p of a reducer being given to another reducer directly. I would have gone for this:
Input-> Map1 -> Reduce1 ->
Identity mapper -> Reducer2 ->
Identity mapper -> Reduce3 ->
Identity mapper -> Reduce4 -> Output
In Hadoop 2.X series, internally you can chain mappers before reducer with ChainMapper and chain Mappers after reducer with ChainReducer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With