i can see in my mapreduce jobs that the output of the reducer part is sorted by key ..
so if i have set number of reducers to 10, the output directory would contain 10 files and each of those output files have a sorted data.
the reason i am putting it here is that even though all the files have sorted data but these files itself are not sorted.. for example : there are scenarios where the part-000* files have started from 0 and end at zzzz assuming i am using Text as the key.
i was assumming that the file's should be sorted even within the files i.e file 1 should have a and the last file part--00009 should have entries with zzzz or atleaset > a
assuming if i have all the alphabets uniformally distributed keys.
could someone throw some light why such a behavior
The output of the Reducer is not re-sorted. Called once at the end of the task. This method is called once for each key.
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce implements sorting algorithm to automatically sort the output key-value pairs from the mapper by their keys. Sorting methods are implemented in the mapper class itself.
In a MapReduce job first mapper executes then Combiner followed by Partitioner. So the execution is in below sequence.>
You can achieve a globally sorted file (which is what you basically want) using these methods:
Write a custom partitioner. Partioner is the class which divides the key space in mapreduce. The default partioner (Hashpartioner) evenly divides the key space into the number of reducers. Check out this example for writing a custom partioner.
Use Hadoop Pig/Hive to do sort.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With