Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Partial aggregation vs Combiners which one faster?

There are notice about what how cascading/scalding optimized map-side evaluation They use so called Partial Aggregation. Is it actually better approach then Combiners? Are there any performance comparison on some common hadoop tasks(word count for example)? If so wether hadoop will support this in future?

like image 979
yura Avatar asked May 23 '26 04:05

yura


1 Answers

In practice, there are more benefits from partial aggregation than from use of combiners.

The cases where combiners are useful are limited. Also, combiners optimize the amount of throughput required by the tasks, not the number of reduces -- that's a subtle distinction which adds up to significant performance deltas.

There is a much broader range of use cases for partial aggregation in large distributed workflows. Also, partial aggregation can be used to optimize the number of job steps required for a workflow.

Examples are shown in https://github.com/Cascading/Impatient/wiki/Part-5 which uses CountBy and SumBy partial aggregates. If you look back in the code commit history on GitHub for that project, there was previously use of GroupBy and Count, which resulted in more reduces.

like image 97
Paco Avatar answered May 25 '26 16:05

Paco



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!