My understanding is to calculate percentiles, the data needs to be sorted. Would this be possible with a huge amount of data spread across multiple servers, without moving it around?
While MapReduce as a paradigm does not looks suited for the problem, hadoop's implementation of MR - is.
Hadoop's implementation of map reduce is based on distributed sort - and it is what you need. Hadoop is doing sort by moving data between servers only once - not that bad.
I would suggest to look onto hadoop terasort implementaiton which illustrate the good (and probabbly the best) way to sort massive data with hadoop. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html
I would first create a histogram, either on one machine or multiple machines. Once you have a count for each possible value of buckets of possible values you can combine these if needed. The gain for using a histogram is that it has O(1) insertion/sort time instead of O(log n) and uses O(M) space where M is the number of possible values or buckets instead of O(N) where N is the number of sample.
A histogram is naturally sorted so you can get a total count and find the percentiles by counting from either end.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With