Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between combinebykey and aggregatebykey

I am very new to Apache spark so this question might not be well to ask, but I am not getting the difference between combinebykey and aggregatebykey and when to use which operation.

like image 648
Tejinder Singh Bedi Avatar asked Apr 19 '17 07:04

Tejinder Singh Bedi


People also ask

What is the difference between reduceByKey and groupByKey?

The groupByKey can cause out of disk problems as data is sent over the network and collected on the reduced workers. You can see the below example. Whereas in reducebykey, Data are combined at each partition, only one output for one key at each partition to send over the network.

What is combineByKey?

It is a generic function to combine the elements for each key using a custom set of aggregation functions. Internally spark combineByKey function efficiently combines the values of a Pair RDD partition by applying the aggregation function.

Why is groupByKey better than reduceByKey?

reduceByKey() works better with larger datasets when compared to groupByKey() . In reduceByKey() , pairs on the same machine with the same key are combined (by using the function passed into reduceByKey() ) before the data is shuffled.

What is aggregateByKey spark?

Function aggregateByKey is one of the aggregate function (Others are reduceByKey & groupByKey) available in Spark. This is the only aggregation function which allows multiple type of aggregation(Maximun, minimun, average, sum & count) at the same time.


1 Answers

aggregateByKey takes an initial accumulator, a first lambda function to merge a value to an accumulator and a second lambda function to merge two accumulators.

combineByKey is more general and adds an initial lambda function to create the initial accumulator

Here an example:

val pairs = sc.parallelize(List(("prova", 1), ("ciao", 2),
                                ("prova", 2), ("ciao", 4),
                                ("prova", 3), ("ciao", 6)))

pairs.aggregateByKey(List[Any]())(
  (aggr, value) => aggr ::: (value :: Nil),
  (aggr1, aggr2) => aggr1 ::: aggr2
).collect().toMap

pairs.combineByKey(
  (value) => List(value),
  (aggr: List[Any], value) => aggr ::: (value :: Nil),
  (aggr1: List[Any], aggr2: List[Any]) => aggr1 ::: aggr2
).collect().toMap
like image 176
freedev Avatar answered Sep 20 '22 17:09

freedev