Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I get invidually sorted Mapper outputs from Hadoop when using zero Reducers?

I have a job in Hadoop 0.20 that needs to operate on large files, one at a time. (It's a pre-processing step to get file-oriented data into a cleaner, line-based format more suitable for MapReduce.)

I don't mind how many output files I have, but each Map's output can be in at most one output file, and each output file must be sorted.

  • If I run with numReducers=0, it runs quickly, and each Mapper writes out its own output file which is fine - but the files aren't sorted.
  • If I add one reducer (plain Reducer.class) this adds an unnecessary global sort step to a single file, which takes many hours (much longer than the Map tasks take).
  • If I add multiple reducers, the results of individual map jobs are mixed together so one Map's output ends up in multiple files.

Is there any way to persuade Hadoop to perform a map-side sort on the output of each job, without using Reducers, or any other way of skipping the slow global merge?

like image 726
Ben Moran Avatar asked Jun 25 '10 12:06

Ben Moran


People also ask

What happens if number of reducers is 0 in Hadoop?

If we set the number of Reducer to 0 (by setting job. setNumreduceTasks(0)), then no reducer will execute and no aggregation will take place. In such case, we will prefer “Map-only job” in Hadoop. In Map-Only job, the map does all task with its InputSplit and the reducer do no job.

Can we have zero reducers if so when if zero reducers Where do sorting happen?

Yes, we can set the Number of Reducer to zero. This means it is map only. The data is not sorted and directly stored in HDFS. If we want the output from mapper to be sorted ,we can use Identity reducer.

Does sorting and shuffling occur on the output of mapper and before the reducer phase?

The keys generated by the mapper are automatically sorted by MapReduce Framework, i.e. Before starting of reducer, all intermediate key-value pairs in MapReduce that are generated by mapper get sorted by key and not by value. Values passed to each reducer are not sorted; they can be in any order.

Is reducer output sorted?

The output of the Reducer is not re-sorted. Called once at the end of the task. This method is called once for each key.


1 Answers

One way of doing global sorting is to have a custom partitioner and do range partitioning for your reducers. For this to work you have to know the range of your mapper output key. You could divide your key range into n buckets where n is the number of reducers. Depending on the bucket the key maps into, the mapper output gets routed to a specific reducer.

Output of each reducer is sorted. Collection of all reducer output is globally sorted, because of the range partitioning. All you have to do is to take the reducer output files in the same order as the 5 digits in the file name.

One thing to watch out for is the skew in your key distribution, which will result in uneven reducer load in the cluster. This problem can be alleviated if you have distribution information i.e., histogram of the key. Then you could make your bucket length unequal and each one holding approximately same number of keys.

Hope it helps.

like image 67
Pranab Avatar answered Sep 21 '22 14:09

Pranab