Shuffle and sort for mapreduce

1 Answers

Shuffle:

MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort and transfers map outputs to the reducers as inputs is known as the shuffle.

Sort:

Sorting happens in various stages of MapReduce program, So can exists in Map and Reduce phases.

Please have a look at this diagram enter image description here

Adding more description to above image in Map and Reduce phases.

The Map Side:

When the map function starts producing output, it is not simply written to disk. Before Map output writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key.

The Reduce Side:

When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side), which merges the map outputs, maintaining their sort ordering. This will be done in rounds.

Source : Hadoop Definitive Guide.

126

answered Oct 26 '22 20:10

mrsrinivas

Related questions
                            
                                Hive Runtime Error while processing row in Hive
                            
                                How to flatMap a function on GroupedDataSet in Apache Flink
                            
                                Hive clustered by on more than one column
                            
                                Hive collect_list() does not collect NULL values
                            
                                Spark Exception : Task failed while writing rows
                            
                                HBase connection exception
                            
                                Hadoop mapreduce : Driver for chaining mappers within a MapReduce job
                            
                                How does HBase guarantee row level atomicity?
                            
                                How to produce massive amount of data?
                            
                                Differences between hflush & hsync api's in HDFS
                            
                                Hadoop - Writing to HBase directly from the Mapper
                            
                                Hadoop and map-reduce on multicore machines
                            
                                Nutch in Windows: Failed to set permissions of path
                            
                                IDE for writing and running hadoop jobs? [closed]
                            
                                Exception in type casting Chararry to double in PIG
                            
                                hdfs dfs command is slow - is there a way to make it faster?
                            
                                Run a hadoop cluster on docker containers
                            
                                hadoop namenode port in use
                            
                                how to load a Kafka topic to HDFS?
                            
                                spark-shell error : No FileSystem for scheme: wasb

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Shuffle and sort for mapreduce

Tags:

hadoop

mapreduce

red

People also ask

1 Answers

mrsrinivas

Recent Activity

Donate For Us