I'm learning Spark and start understanding how Spark distributes the data and combines the results. I came to the conclusion that using the operation map followed by reduce has an advantage on using just the operation aggregate. This is (at least I believe so) because aggregate uses a sequential operation, which hurts parallelism, while map and reduce can benefit from full parallelism. So when having a choice, isn't it better to use map and reduce than aggregate ? Are there cases where aggregate is preferred ? Or maybe when aggregate can't be replaced by the combination map and reduce ?
As an example - I want to find the string with the max length:
val z = sc.parallelize(List("123","12","345","4567"))
// instead of this aggregate ....
z.aggregate(0)((x, y) => math.max(x, y.length), (x, y) => math.max(x, y))
// .... shouldn't I rather use this map - reduce combination ?
z.map(_.length).reduce((x, y) => math.max(x, y))
Apache Spark is very much popular for its speed. It runs 100 times faster in memory and ten times faster on disk than Hadoop MapReduce since it processes data in memory (RAM). At the same time, Hadoop MapReduce has to persist data back to the disk after every Map or Reduce action.
The primary difference between Spark and MapReduce is that Spark processes and retains data in memory for subsequent steps, whereas MapReduce processes data on disk. As a result, for smaller workloads, Spark's data processing speeds are up to 100x faster than MapReduce.
Spark uses the Hadoop MapReduce distributed computing framework as its foundation.
MapReduce is designed for batch processing and is not as fast as Spark. It is used for gathering data from multiple sources and process it once and store in a distributed data store like HDFS. It is best suited where memory is limited and processing data size is so big that it would not fit in the available memory.
A little example will can be better than long explanations.
Imagine you have a class Toto
with an age
field. You have many Toto and you desire to compute sum of ages of every Toto.
final case class Toto(val age: Int)
val rdd = sc.parallelize(0 until n).map(Toto(_))
// map/reduce style
val sum1 = rdd
// O(n) operations to go througth every Toto's age
.map(_.age)
// another O(n) to access data then O(n) operations to sum the n values
.reduce(_ + _)
// You get the result with 2 pass over your data plus O(n) additions
// aggregate style
val sum2 = rdd.aggregate(0)((agg, e) => agg + e.age, _ + _)
// With one pass over the data, and O(n) additions you obtain the same result
It's a bit more complicate if you take into account access and each operations.
Because aggregate still access then sum the age into the aggregate wich represent O(2.n) operations, O(n) access plus O(n) additions, plus negligeable merged operation between aggregates.
On the other side with map/reduce style, first the map represent O(n) access, then again O(n) access to data to reduce them with an overhead of O(n) addition operations for a total of O(3.n) operations.
Without forgetting the fact that Spark is lazy and all of your transformation will be leverage by a final action.
I presume that using aggregate will save some operations and then will improve application running time. But depending on what you're doing it could be more usefull to express successive map followed by a reduce for readability compare to an aggregate or combineByKey (generalization of aggregateByKey). So i will suppose that it depends on which goals you desire to reach depending the use case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With