Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does the shuffle step in a MapReduce program run in parallel with Mapping?

I was trying to understand a MapReduce program. While doing that, I noticed that the reduce tasks start executing almost immediately after all the maps are tasked are finished. Now, this is surprising, because the reduce tasks there work with data that is grouped by key, meaning that there is shuffle/sort step done in between. The only way this could happen is if the shuffling was being done in parallel with mapping.

Secondly, if shuffling is indeed done in parallel with mapping, what is the equivalent of that in Apache Spark? Can mapping and grouping by keys and/or sorting happen in parallel there too?

like image 486
pythonic Avatar asked Apr 04 '17 20:04

pythonic


People also ask

What is the shuffle procedure in MapReduce?

Shuffling in MapReduceThe process of transferring data from the mappers to reducers is known as shuffling i.e. the process by which the system performs the sort and transfers the map output to the reducer as input.

What happens in shuffling at MapReduce phase?

Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer in MapReduce. Sort phase in MapReduce covers the merging and sorting of map outputs. Data from the mapper are grouped by the key, split among reducers, and sorted by the key.

Is MapReduce parallel processing?

MapReduce is an attractive model for parallel data processing in high- performance cluster computing environments. The scalability of MapReduce is proven to be high, because a job in the MapReduce model is partitioned into numerous small tasks running on multiple machines in a large-scale cluster.

How does MapReduce use parallel processing?

MapReduce Execution Overview The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits or shards. The input shards can be processed in parallel on different machines.


1 Answers

Hadoop's MapReduce is not just map and reduce stages there are additional steps like combiners (map-side reduce) and merge as illustrated below (taken from http://www.bodhtree.com/blog/2012/10/18/ever-wondered-what-happens-between-map-and-reduce/) source: http://www.bodhtree.com/blog/2012/10/18/ever-wondered-what-happens-between-map-and-reduce/ While maps are still running and as they emit keys these keys can be routed and merged and by the time map finished all of the information needed for some reduce buckets may already be processed and ready for reduce.

Spark builds a DAG (direct acyclic graph) of the phases needed to process and groups them into stages where data needs to be shuffled between nodes. Unlike Hadoop where the data is pushed during map, spark reducers pull data and thus only do that when they begin to run (on the other hand Spark tries to run more in memory (vs. disk) and working with a DAG, handles iterative processing better)

Alexey Grishchenko has a good explanation of Spark Shuffle here (note that as of Spark 2 only sort shuffle exists)

like image 99
Arnon Rotem-Gal-Oz Avatar answered Sep 24 '22 05:09

Arnon Rotem-Gal-Oz