Where is Sort used in MapReduce phase and why?

Tags:

I am new to hadoop here. It is not clear why we need to be able to sort by keys while using hadoop mapreduce ? After map phase, we need to distribute the data corresponding to each unique key to some number of reducers. This can be done without having the need to sort it right ?

986

asked Jul 31 '12 18:07

user428900

2 Answers

It is there, because sorting is a neat trick to group your keys. Of course, if your job or algorithm does not need any order of your keys, then you will be faster to group by some hashing trick.

In Hadoop itself, there is already a JIRA filed for that since years (source). Several other distributions that layer on top of Hadoop have these features already, Hanborq for example (they call it sort avoidance). (source)

To your actual question (Why), MapReduce was inherently a paper from Google (source) which states the following:

We guarantee that within a given partition, the intermediate key/value pairs are processed in increasing key order. This ordering guarantee makes it easy to generate a sorted output file per partition, which is useful when the output file format needs to support efficient random access lookups by key, or users of the output find it convenient to have the data sorted.

So it was more a convenience decision to support sort, but not to inherently only allow sort to group keys.

answered Dec 06 '22 18:12

Thomas Jungblut

The "sort by key" is best understood if we consider the fact that hadoop DISTRIBUTES processes for you by sending different keys to different machines. The basic (simplified) version of the idea is this :

The reducer which a (k,v) pair is sent to = k.hashCode()%num_of_machines.

So, if my key's hashcode is 10, and I have 2 machines, the key will be sent to machine #0, for example.

Thus, the key will (first) gives us a simple way to distribute computation.

In addition to simplifying distribution of computation, keys give us a way to join records from disparate data files into a single cluster. This is how we can do things like word_count, for example.

In fact, if you are finding that you don't need keys --- you probably don't need hadoop either !

The classic example (word count):

In the hadoop "word count" example, we emit keys (one key = one word) with values (# times that word was seen in a segment of text). This allows a SINGLE reduce function to receive a SINGLE word, and thus add all the times it was seen, creating an accurate word count.

Thus, aggregation of keys are what allow the "map" phase to be distributed across multiple machines independently. Without aggregating keys to the same reducer, in the word count example, we might get several word counts for a given word, since there is no gaurantee that a single reducer would receive all word counts from all files.

Another example:

Now... lets say we have social security numbers as ids and we want to output an aggregation of personal data. Lets say we have 2 massive files.

ssn->name

ssn->shoe_size

In this case, we can leverage the power of key grouping, such that an individuals name and shoe size are BOTH sent to the SAME reduce function.

The reducer(2) will receive 2 records here:

ssn->name,shoe_size

The idea here is that when writing map/reduce jobs, you have to encode your "tuples" that are outputted in such a way that they can be joined together in a meaningful way, in the reduce phase. Any distributed computing environments will likely, at some point, need to combine records calculated in different nodes. Keys give us a convenient and scalable methodology for doing this.

So -- the fact that we are gauranteed that the SAME keys go to the SAME reducer function confirms that EACH reducer for this particular social secuirty number will receive ALL data associated with that number, allowing us to join and output data records which include ssn, name, and shoe size.

Conclusion

Without distributing by key, joining data in such a way would require painfully complex logic involving some kind of intermediary data storage / caching. Hadoop simply generalizes and abstracts the common need to "join" data results from parallel computations by using a familiar pardigm : keys and values.

answered Dec 06 '22 16:12

jayunit100

Related questions
                            
                                Hadoop HDFS maximum file size
                            
                                Partition Hive table by existing field?
                            
                                Hadoop read multiple lines at a time
                            
                                Hadoop slowstart configuration
                            
                                Why is Maven trying to compile my code as -source 1.3?
                            
                                Name Node stores what?
                            
                                Hadoop log4j not working as No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory)
                            
                                GlusterFS or Ceph as backend for Hadoop
                            
                                Spark + Scala transformations, immutability & memory consumption overheads
                            
                                Difference between 'distcp' and 'distcp -update'?
                            
                                Filter a string on the basis of a word
                            
                                How can I concatenate two files in hadoop into one using Hadoop FS shell?
                            
                                What does CPU Time for a Hadoop Job signify?
                            
                                How to pull data from Mainframe to Hadoop
                            
                                Failed to set permissions of path: \tmp
                            
                                Apache hive MSCK REPAIR TABLE new partition not added
                            
                                How to save Spark RDD to local filesystem
                            
                                Will Spark SQL completely replace Apache Impala or Apache Hive?
                            
                                CASE statements in Hive
                            
                                how to handle millions of smaller s3 files with apache spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Where is Sort used in MapReduce phase and why?

Tags:

hadoop

mapreduce

user428900

People also ask

2 Answers

Thomas Jungblut

jayunit100

Recent Activity

Donate For Us