MapReduce job Output sort order

Tags:

mapreduce

i can see in my mapreduce jobs that the output of the reducer part is sorted by key ..

so if i have set number of reducers to 10, the output directory would contain 10 files and each of those output files have a sorted data.

the reason i am putting it here is that even though all the files have sorted data but these files itself are not sorted.. for example : there are scenarios where the part-000* files have started from 0 and end at zzzz assuming i am using Text as the key.

i was assumming that the file's should be sorted even within the files i.e file 1 should have a and the last file part--00009 should have entries with zzzz or atleaset > a

assuming if i have all the alphabets uniformally distributed keys.

could someone throw some light why such a behavior

253

asked Jan 14 '13 16:01

dpsdce

1 Answers

You can achieve a globally sorted file (which is what you basically want) using these methods:

Use just one reducer in mapreduce (bad idea !! This puts too much work on one machine)
Write a custom partitioner. Partioner is the class which divides the key space in mapreduce. The default partioner (Hashpartioner) evenly divides the key space into the number of reducers. Check out this example for writing a custom partioner.
Use Hadoop Pig/Hive to do sort.

132

answered Sep 24 '22 23:09

srini

Related questions
                            
                                Send KafkaProducer from local machine to hortonworks sandbox on virtualbox
                            
                                Setting spark classpaths on EC2: spark.driver.extraClassPath and spark.executor.extraClassPath
                            
                                winutils.exe chmod command doesn't set permission
                            
                                sparkSession/sparkContext can not get hadoop configuration
                            
                                Are there any distributed machine learning libraries for using Python with Hadoop? [closed]
                            
                                (HBase) Error: JAVA_HOME is not set and Java could not be found
                            
                                Incorrect memory allocation for Yarn/Spark after automatic setup of Dataproc Cluster
                            
                                Using S3 (Frankfurt) with Spark
                            
                                Python Spark / Yarn memory usage
                            
                                httpfs error Operation category READ is not supported in state standby
                            
                                MapReduceBase and Mapper deprecated
                            
                                GUI for using Hadoop [closed]
                            
                                how to deploy and run oozie job?
                            
                                Can druid replace hadoop?
                            
                                Hadoop Java Error : Exception in thread "main" java.lang.NoClassDefFoundError: WordCount (wrong name: org/myorg/WordCount)
                            
                                Copy files to local from multiple directories in HDFS for last 24 hours
                            
                                Exception in thread "main" java.lang.ClassNotFoundException: WordCount
                            
                                Opening a file stored in HDFS to edit in VI
                            
                                Convert YYYYMMDD String to Date in Impala
                            
                                What is the purpose of the org.apache.hadoop.mapreduce.Mapper.run() function in Hadoop?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With