Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can percentiles of a set of data be calculated in a map-reduce manner?

My understanding is to calculate percentiles, the data needs to be sorted. Would this be possible with a huge amount of data spread across multiple servers, without moving it around?

like image 282
marathon Avatar asked Oct 07 '22 15:10

marathon


2 Answers

While MapReduce as a paradigm does not looks suited for the problem, hadoop's implementation of MR - is.
Hadoop's implementation of map reduce is based on distributed sort - and it is what you need. Hadoop is doing sort by moving data between servers only once - not that bad.
I would suggest to look onto hadoop terasort implementaiton which illustrate the good (and probabbly the best) way to sort massive data with hadoop. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html

like image 89
David Gruzman Avatar answered Oct 13 '22 12:10

David Gruzman


I would first create a histogram, either on one machine or multiple machines. Once you have a count for each possible value of buckets of possible values you can combine these if needed. The gain for using a histogram is that it has O(1) insertion/sort time instead of O(log n) and uses O(M) space where M is the number of possible values or buckets instead of O(N) where N is the number of sample.

A histogram is naturally sorted so you can get a total count and find the percentiles by counting from either end.

like image 42
Peter Lawrey Avatar answered Oct 13 '22 12:10

Peter Lawrey