Spark: aggregate versus map and reduce

Tags:

I'm learning Spark and start understanding how Spark distributes the data and combines the results. I came to the conclusion that using the operation map followed by reduce has an advantage on using just the operation aggregate. This is (at least I believe so) because aggregate uses a sequential operation, which hurts parallelism, while map and reduce can benefit from full parallelism. So when having a choice, isn't it better to use map and reduce than aggregate ? Are there cases where aggregate is preferred ? Or maybe when aggregate can't be replaced by the combination map and reduce ?

As an example - I want to find the string with the max length:

val z = sc.parallelize(List("123","12","345","4567"))
// instead of this aggregate ....
z.aggregate(0)((x, y) => math.max(x, y.length), (x, y) => math.max(x, y))
// .... shouldn't I rather use this map - reduce combination ?
z.map(_.length).reduce((x, y) => math.max(x, y))

697

asked Sep 21 '18 08:09

Sorin-Alexandru Cristescu

1 Answers

A little example will can be better than long explanations.

Imagine you have a class Toto with an age field. You have many Toto and you desire to compute sum of ages of every Toto.

final case class Toto(val age: Int)

val rdd = sc.parallelize(0 until n).map(Toto(_))

// map/reduce style
val sum1 = rdd
             // O(n) operations to go througth every Toto's age
             .map(_.age)
             // another O(n) to access data then O(n) operations to sum the n values
             .reduce(_ + _)
// You get the result with 2 pass over your data plus O(n) additions

// aggregate style
val sum2 = rdd.aggregate(0)((agg, e) => agg + e.age, _ + _)
// With one pass over the data, and O(n) additions you obtain the same result

It's a bit more complicate if you take into account access and each operations.

Because aggregate still access then sum the age into the aggregate wich represent O(2.n) operations, O(n) access plus O(n) additions, plus negligeable merged operation between aggregates.

On the other side with map/reduce style, first the map represent O(n) access, then again O(n) access to data to reduce them with an overhead of O(n) addition operations for a total of O(3.n) operations.

Without forgetting the fact that Spark is lazy and all of your transformation will be leverage by a final action.

I presume that using aggregate will save some operations and then will improve application running time. But depending on what you're doing it could be more usefull to express successive map followed by a reduce for readability compare to an aggregate or combineByKey (generalization of aggregateByKey). So i will suppose that it depends on which goals you desire to reach depending the use case.

134

answered Nov 15 '22 07:11

KyBe

Related questions
                            
                                IllegalAccessError in Spark caused by async-http-client
                            
                                Apache Spark: In SparkSql, are sql's vulnerable to Sql Injection [duplicate]
                            
                                rank() function usage in Spark SQL
                            
                                Spark reading from Postgres JDBC table slow
                            
                                Scala Spark connect to remote cluster
                            
                                Column features must be of type org.apache.spark.ml.linalg.VectorUDT
                            
                                failing to connect to spark driver when submitting job to spark in yarn mode
                            
                                How to convert the group by function to data frame
                            
                                Ubuntu install apache spark via apt-get
                            
                                How can you update values in a dataset?
                            
                                How to add sparse vectors after group by, using Spark SQL?
                            
                                Understanding Apache Spark RDD task serialization
                            
                                Why does Kafka Direct Stream create a new decoder for every message?
                            
                                How to compute statistics on a streaming dataframe for different type of columns in a single query?
                            
                                ArrayIndexOutOfBoundsException when reading csv file in spark
                            
                                Difference between createOrReplaceGlobalTempView and createOrReplaceTempView
                            
                                How to write integration tests for Sparks new Structured Streaming?
                            
                                Spark can't find the application class itself (ClassNotFoundException) in spark-submit with SBT assembly JAR
                            
                                How to read a compressed (gzip) file without extension in Spark
                            
                                Pyspark: java.lang.OutOfMemoryError: GC overhead limit exceeded

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: aggregate versus map and reduce

Tags:

apache-spark

mapreduce

Sorin-Alexandru Cristescu

People also ask

1 Answers

KyBe

Recent Activity

Donate For Us