Does the shuffle step in a MapReduce program run in parallel with Mapping?

Tags:

I was trying to understand a MapReduce program. While doing that, I noticed that the reduce tasks start executing almost immediately after all the maps are tasked are finished. Now, this is surprising, because the reduce tasks there work with data that is grouped by key, meaning that there is shuffle/sort step done in between. The only way this could happen is if the shuffling was being done in parallel with mapping.

Secondly, if shuffling is indeed done in parallel with mapping, what is the equivalent of that in Apache Spark? Can mapping and grouping by keys and/or sorting happen in parallel there too?

486

asked Apr 04 '17 20:04

pythonic

1 Answers

Hadoop's MapReduce is not just map and reduce stages there are additional steps like combiners (map-side reduce) and merge as illustrated below (taken from http://www.bodhtree.com/blog/2012/10/18/ever-wondered-what-happens-between-map-and-reduce/) source: http://www.bodhtree.com/blog/2012/10/18/ever-wondered-what-happens-between-map-and-reduce/ While maps are still running and as they emit keys these keys can be routed and merged and by the time map finished all of the information needed for some reduce buckets may already be processed and ready for reduce.

Spark builds a DAG (direct acyclic graph) of the phases needed to process and groups them into stages where data needs to be shuffled between nodes. Unlike Hadoop where the data is pushed during map, spark reducers pull data and thus only do that when they begin to run (on the other hand Spark tries to run more in memory (vs. disk) and working with a DAG, handles iterative processing better)

Alexey Grishchenko has a good explanation of Spark Shuffle here (note that as of Spark 2 only sort shuffle exists)

answered Sep 24 '22 05:09

Arnon Rotem-Gal-Oz

Related questions
                            
                                Ports and protocol used by Maven to download dependencies in Eclipse?
                            
                                Streaming live video from Raspberry Pi to my Android App but getting security exception
                            
                                What is the best way of securing a REST API? [closed]
                            
                                Java Stream: aggregate in list all not-Null collections received in map()
                            
                                Some currency signs missing in Java currency API
                            
                                Limitation of path in Spring
                            
                                java.io.FileNotFoundException: /storage/emulated/0/
                            
                                Got "The type java.lang.CharSequence cannot be resolved" error when compile JasperReports template
                            
                                Why doesn't an array access expression of a null array reference throw a NullPointerException?
                            
                                Job history server in Hadoop 2.7.1 is not working
                            
                                Using Espresso to stub Intents started during the onCreate of the tested activity
                            
                                GridView for calendar view
                            
                                Method overloading with both static and non-static methods
                            
                                2D Array stream reduce in java
                            
                                Most efficient way to get the highest number from a collection of integers
                            
                                Swagger UI Displays but I get an "ERROR" indicator
                            
                                Use MockMVC outside SpringBoot application
                            
                                Localization does not work for string array in android application
                            
                                Java: import statement vs fully qualified name?
                            
                                Why Java Beans are called "beans"?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does the shuffle step in a MapReduce program run in parallel with Mapping?

Tags:

java

scala

apache-spark

hadoop

mapreduce

pythonic

People also ask

1 Answers

Arnon Rotem-Gal-Oz

Recent Activity

Donate For Us