Can percentiles of a set of data be calculated in a map-reduce manner?

Question

My understanding is to calculate percentiles, the data needs to be sorted. Would this be possible with a huge amount of data spread across multiple servers, without moving it around?

David Gruzman · Accepted Answer

While MapReduce as a paradigm does not looks suited for the problem, hadoop's implementation of MR - is.
Hadoop's implementation of map reduce is based on distributed sort - and it is what you need. Hadoop is doing sort by moving data between servers only once - not that bad.
I would suggest to look onto hadoop terasort implementaiton which illustrate the good (and probabbly the best) way to sort massive data with hadoop. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html

Peter Lawrey · Answer

I would first create a histogram, either on one machine or multiple machines. Once you have a count for each possible value of buckets of possible values you can combine these if needed. The gain for using a histogram is that it has O(1) insertion/sort time instead of O(log n) and uses O(M) space where M is the number of possible values or buckets instead of O(N) where N is the number of sample.

A histogram is naturally sorted so you can get a total count and find the percentiles by counting from either end.

Can percentiles of a set of data be calculated in a map-reduce manner?

Tags:

java

statistics

percentile

mapreduce

marathon

2 Answers

David Gruzman

Peter Lawrey

Recent Activity

Donate For Us

Can percentiles of a set of data be calculated in a map-reduce manner?

Tags:

java

statistics

percentile

mapreduce

marathon

2 Answers

David Gruzman

Peter Lawrey

Related questions

Recent Activity

Donate For Us