Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shuffle and sort for mapreduce

I read through the definitive guide and some other links on the web including the one here

My question is

where exactly does shuffling and sorting happen?

As per my understanding, they happen on both mappers and reducers. But some links mention that shuffling happens on mappers and sorting on reducers.

Can someone confirm if my understanding is correct; if not can they provide additional documentation I can go through?

like image 577
red Avatar asked Sep 18 '16 21:09

red


People also ask

Can sorting be done with MapReduce?

Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce implements sorting algorithm to automatically sort the output key-value pairs from the mapper by their keys. Sorting methods are implemented in the mapper class itself.

What is the purpose of shuffling in MapReduce?

1 Answer. In Hadoop MapReduce, the process of shuffling is used to transfer data from the mappers to the necessary reducers. It is the process in which the system sorts the unstructured data and transfers the output of the map as an input to the reducer.

Which sorting method is used in MapReduce?

Merge sort is the default feature of MapReduce.

What is shuffling does shuffling happen in map transformation?

SHUFFLING is the process of moving Mapper outputs to the Reducer. After the first map task, the mapper nodes start exchanging their intermediate outputs from map tasks to the reducers so that similar keys from the mapper nodes reach the same reducer node.


1 Answers

Shuffle:

MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort and transfers map outputs to the reducers as inputs is known as the shuffle.

Sort:

Sorting happens in various stages of MapReduce program, So can exists in Map and Reduce phases.

Please have a look at this diagram enter image description here

Adding more description to above image in Map and Reduce phases.

The Map Side:

When the map function starts producing output, it is not simply written to disk. Before Map output writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key.

The Reduce Side:

When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side), which merges the map outputs, maintaining their sort ordering. This will be done in rounds.

Source : Hadoop Definitive Guide.

like image 126
mrsrinivas Avatar answered Oct 26 '22 20:10

mrsrinivas