I have a job in Hadoop 0.20 that needs to operate on large files, one at a time. (It's a pre-processing step to get file-oriented data into a cleaner, line-based format more suitable for MapReduce.)
I don't mind how many output files I have, but each Map's output can be in at most one output file, and each output file must be sorted.
Is there any way to persuade Hadoop to perform a map-side sort on the output of each job, without using Reducers, or any other way of skipping the slow global merge?
If we set the number of Reducer to 0 (by setting job. setNumreduceTasks(0)), then no reducer will execute and no aggregation will take place. In such case, we will prefer “Map-only job” in Hadoop. In Map-Only job, the map does all task with its InputSplit and the reducer do no job.
Yes, we can set the Number of Reducer to zero. This means it is map only. The data is not sorted and directly stored in HDFS. If we want the output from mapper to be sorted ,we can use Identity reducer.
The keys generated by the mapper are automatically sorted by MapReduce Framework, i.e. Before starting of reducer, all intermediate key-value pairs in MapReduce that are generated by mapper get sorted by key and not by value. Values passed to each reducer are not sorted; they can be in any order.
The output of the Reducer is not re-sorted. Called once at the end of the task. This method is called once for each key.
One way of doing global sorting is to have a custom partitioner and do range partitioning for your reducers. For this to work you have to know the range of your mapper output key. You could divide your key range into n buckets where n is the number of reducers. Depending on the bucket the key maps into, the mapper output gets routed to a specific reducer.
Output of each reducer is sorted. Collection of all reducer output is globally sorted, because of the range partitioning. All you have to do is to take the reducer output files in the same order as the 5 digits in the file name.
One thing to watch out for is the skew in your key distribution, which will result in uneven reducer load in the cluster. This problem can be alleviated if you have distribution information i.e., histogram of the key. Then you could make your bucket length unequal and each one holding approximately same number of keys.
Hope it helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With