How does the MapReduce sort algorithm work?

People also ask

How does sort algorithm work?

Sorting algorithms are a set of instructions that take an array or list as an input and arrange the items into a particular order. Sorts are most commonly in numerical or a form of alphabetical (or lexicographical) order, and can be in ascending (A-Z, 0-9) or descending (Z-A, 9-0) order.

How does MapReduce sort and shuffle work?

What is MapReduce Shuffling and Sorting? Shuffling is the process by which it transfers mappers intermediate output to the reducer. Reducer gets 1 or more keys and associated values on the basis of reducers. The intermediated key – value generated by mapper is sorted automatically by key.

What is a MapReduce algorithm?

MapReduce is a Distributed Data Processing Algorithm introduced by Google. MapReduce Algorithm is mainly inspired by Functional Programming model. MapReduce algorithm is useful to process huge amount of data in parallel, reliable and efficient way in cluster environments.

Here are some details on Hadoop's implementation for Terasort:

TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."

So their trick is in the way they determine the keys during the map phase. Essentially they ensure that every value in a single reducer is guaranteed to be 'pre-sorted' against all other reducers.

I found the paper reference through James Hamilton's Blog Post.

Google Reference: MapReduce: Simplified Data Processing on Large Clusters

Appeared in:
OSDI'04: Sixth Symposium on Operating System Design and Implementation,
San Francisco, CA, December, 2004.

That link has a PDF and HTML-Slide reference.

There is also a Wikipedia page with description with implementation references.

Also criticism,

David DeWitt and Michael Stonebraker, pioneering experts in parallel databases and shared nothing architectures, have made some controversial assertions about the breadth of problems that MapReduce can be used for. They called its interface too low-level, and questioned whether it really represents the paradigm shift its proponents have claimed it is. They challenge the MapReduce proponents' claims of novelty, citing Teradata as an example of prior art that has existed for over two decades; they compared MapReduce programmers to Codasyl programmers, noting both are "writing in a low-level language performing low-level record manipulation". MapReduce's use of input files and lack of schema support prevents the performance improvements enabled by common database system features such as B-trees and hash partitioning, though projects such as PigLatin and Sawzall are starting to address these problems.

I had the same question while reading Google's MapReduce paper. @Yuval F 's answer pretty much solved my puzzle.

One thing I noticed while reading the paper is that the magic happens in the partitioning (after map, before reduce).

The paper uses hash(key) mod R as the partitioning example, but this is not the only way to partition intermediate data to different reduce tasks.

Just add boundary conditions to @Yuval F 's answer to make it complete: suppose min(S) and max(S) is the minimum key and maximum key among the sampled keys; all keys < min(S) are partitioned to one reduce task; vice versa, all keys >= max(S) are partitioned to one reduce task.

There is no hard limitation on the sampled keys, like min or max. Just, more evenly these R keys distributed among all the keys, more "parallel" this distributed system is and less likely a reduce operator has memory overflow issue.

Related questions
                            
                                Is there a perfect algorithm for chess? [closed]
                            
                                std::back_inserter for a std::set?
                            
                                Calculating new longitude, latitude from old + n meters
                            
                                Compare Strings Javascript Return %of Likely
                            
                                What would cause an algorithm to have O(log n) complexity?
                            
                                Combined area of overlapping circles
                            
                                What would cause an algorithm to have O(log log n) complexity?
                            
                                how to split an iterable in constant-size chunks
                            
                                Rounding DateTime objects
                            
                                How to find the kth smallest element in the union of two sorted arrays?
                            
                                Polynomial time and exponential time
                            
                                Equation (expression) parser with precedence?
                            
                                Image Segmentation using Mean Shift explained
                            
                                Algorithm for classifying words for hangman difficulty levels as "Easy","Medium", or "Hard"
                            
                                How to reverse a singly linked list using only two pointers?
                            
                                Difference between Prim's and Dijkstra's algorithms?
                            
                                Calculating frames per second in a game
                            
                                Find kth smallest element in a binary search tree in Optimum way
                            
                                .NET - How can you split a "caps" delimited string into an array?
                            
                                Quick Sort Vs Merge Sort [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does the MapReduce sort algorithm work?

Tags:

algorithm

sorting

parallel-processing

hadoop

mapreduce

People also ask

Recent Activity

Donate For Us