I am very new to Apache spark so this question might not be well to ask, but I am not getting the difference between combinebykey
and aggregatebykey
and when to use which operation.
The groupByKey can cause out of disk problems as data is sent over the network and collected on the reduced workers. You can see the below example. Whereas in reducebykey, Data are combined at each partition, only one output for one key at each partition to send over the network.
It is a generic function to combine the elements for each key using a custom set of aggregation functions. Internally spark combineByKey function efficiently combines the values of a Pair RDD partition by applying the aggregation function.
reduceByKey() works better with larger datasets when compared to groupByKey() . In reduceByKey() , pairs on the same machine with the same key are combined (by using the function passed into reduceByKey() ) before the data is shuffled.
Function aggregateByKey is one of the aggregate function (Others are reduceByKey & groupByKey) available in Spark. This is the only aggregation function which allows multiple type of aggregation(Maximun, minimun, average, sum & count) at the same time.
aggregateByKey
takes an initial accumulator, a first lambda function to merge a value to an accumulator and a second lambda function to merge two accumulators.
combineByKey
is more general and adds an initial lambda function to create the initial accumulator
Here an example:
val pairs = sc.parallelize(List(("prova", 1), ("ciao", 2),
("prova", 2), ("ciao", 4),
("prova", 3), ("ciao", 6)))
pairs.aggregateByKey(List[Any]())(
(aggr, value) => aggr ::: (value :: Nil),
(aggr1, aggr2) => aggr1 ::: aggr2
).collect().toMap
pairs.combineByKey(
(value) => List(value),
(aggr: List[Any], value) => aggr ::: (value :: Nil),
(aggr1: List[Any], aggr2: List[Any]) => aggr1 ::: aggr2
).collect().toMap
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With