I am learning spark using the book 'Learning Spark'. Came across this term(Page 54) <code>We can disable map-side aggregation in combineByKey() if we know that our data won’t benefit from it</code> I am confused what is meant by map-side aggregation here?. The only thing that comes to my mind is Mapper & Reducer in Hadoop MapReduce...but believe that is in no way related to Spark.

Idea behind using map-side aggregations is pretty much the same as Hadoop combiners. If a single mapper can yield multiple values for the same key you can reduce shuffling by reducing values locally. One example of operation which can benefit from map-side aggregation is creating set of value for each key, especially when you partition a RDD before combining: First lets create some dummy data: <pre class="prettyprint"><code>val pairs = sc.parallelize( ("foo", 1) :: ("foo", 1) :: ("foo", 2) :: ("bar", 3) :: ("bar", 4) :: ("bar", 5) :: Nil ) </code></pre> And merge data using <code>combineByKey</code>: <pre class="prettyprint"><code>import collection.mutable.{Set => MSet} val combined = partitionedPairs.combineByKey( (v: Int) => MSet[Int](v), (set: MSet[Int], v: Int) => set += v, (set1: MSet[Int], set2: MSet[Int]) => set1 ++= set2 ) </code></pre> Depending on the data distribution this can significantly reduce network traffic. Overall <ul> <li> <code>reduceByKey</code>,</li> <li> <code>combineByKey</code> with <code>mapSideCombine</code> set to <code>true</code> </li> <li><code>aggregateByKey</code></li> <li><code>foldByKey</code></li> </ul> will use map side aggregations, while <code>groupByKey</code> and <code>combineByKey</code> with <code>mapSideCombine</code> set to <code>false</code> won't. The choice however between applying map side aggregations or not is not always obvious. Cost of maintaining required data structures and subsequent garbage collection can in many cases exceed cost of shuffle.

'map-side' aggregation in Spark

Tags:

apache-spark

I am learning spark using the book 'Learning Spark'. Came across this term(Page 54) We can disable map-side aggregation in combineByKey() if we know that our data won’t benefit from it I am confused what is meant by map-side aggregation here?. The only thing that comes to my mind is Mapper & Reducer in Hadoop MapReduce...but believe that is in no way related to Spark.

263

asked Jul 08 '15 05:07

Raj

1 Answers

Idea behind using map-side aggregations is pretty much the same as Hadoop combiners. If a single mapper can yield multiple values for the same key you can reduce shuffling by reducing values locally.

One example of operation which can benefit from map-side aggregation is creating set of value for each key, especially when you partition a RDD before combining:

First lets create some dummy data:

val pairs = sc.parallelize(
    ("foo", 1) :: ("foo", 1) :: ("foo", 2) ::
    ("bar", 3) :: ("bar", 4) :: ("bar", 5) :: Nil
)

And merge data using combineByKey:

import collection.mutable.{Set => MSet}
val combined = partitionedPairs.combineByKey(
    (v: Int) => MSet[Int](v),
    (set: MSet[Int], v: Int) => set += v,
    (set1: MSet[Int], set2: MSet[Int]) => set1 ++= set2
)

Depending on the data distribution this can significantly reduce network traffic. Overall

reduceByKey,
combineByKey with mapSideCombine set to true
aggregateByKey
foldByKey

will use map side aggregations, while groupByKey and combineByKey with mapSideCombine set to false won't.

The choice however between applying map side aggregations or not is not always obvious. Cost of maintaining required data structures and subsequent garbage collection can in many cases exceed cost of shuffle.

143

answered Sep 28 '22 19:09

zero323

Related questions
                            
                                UnresolvedException: Invalid call to dataType on unresolved object when using DataSet constructed from Seq.empty (since Spark 2.3.0)
                            
                                Co-partitioned joins in spark SQL
                            
                                Understanding shuffle managers in Spark
                            
                                Spark - StorageLevel (DISK_ONLY vs MEMORY_AND_DISK) and Out of memory Java heap space
                            
                                Loading a pyspark ML model in a non-Spark environment
                            
                                Monitoring Structured Streaming
                            
                                SparkR filterRDD and flatMap not working
                            
                                Can do without spark-submit in java?
                            
                                Connecting to remote master on standalone Spark
                            
                                Unable to launch SparkR in RStudio
                            
                                In Spark, is it possible to share data between two executors?
                            
                                Object cache on Spark executors
                            
                                How to flatten the data of different data types by using Sparklyr package?
                            
                                How does Apache spark handle python multithread issues?
                            
                                Use schema to convert AVRO messages with Spark to DataFrame
                            
                                Distributed Map in Scala Spark
                            
                                Apache Spark EOF exception
                            
                                How to save and load MLLib model in Apache Spark?
                            
                                Spark Streaming + Kafka: SparkException: Couldn't find leader offsets for Set
                            
                                How to read records in JSON format from Kafka using Structured Streaming?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With