Computing median in map reduce

Tags:

Can someone example the computation of median/quantiles in map reduce?

My understanding of Datafu's median is that the 'n' mappers sort the data and send the data to "1" reducer which is responsible for sorting all the data from n mappers and finding the median(middle value) Is my understanding correct?,

if so, does this approach scale for massive amounts of data as i can clearly see the one single reducer struggling to do the final task. Thanks

613

asked Apr 11 '12 15:04

learner

2 Answers

Trying to find the median (middle number) in a series is going to require that 1 reducer is passed the entire range of numbers to determine which is the 'middle' value.

Depending on the range and uniqueness of values in your input set, you could introduce a combiner to output the frequency of each value - reducing the number of map outputs sent to your single reducer. Your reducer can then consume the sort value / frequency pairs to identify the median.

Another way you could scale this (again if you know the range and rough distribution of values) is to use a custom partitioner that distributes the keys by range buckets (0-99 go to reducer 0, 100-199 to reducer 2, and so on). This will however require some secondary job to examine the reducer outputs and perform the final median calculation (knowing for example the number of keys in each reducer, you can calculate which reducer output will contain the median, and at which offset)

186

answered Sep 30 '22 03:09

Chris White

Do you really need the exact median and quantiles?

A lot of the time, you are better off with just getting approximate values, and working with them, in particular if you use this for e.g. data partitioning.

In fact, you can use the approximate quantiles to speed up finding the exact quantiles (actually in O(n/p) time), here is a rough outline of the strategy:

Have a mapper for each partition compute the desired quantiles, and output them to a new data set. This data set should be several order of magnitues smaller (unless you ask for too many quantiles!)
Within this data set, compute the quantiles again, similar to "median of medians". These are your initial estimates.
Repartition the data according to these quantiles (or even additional partitions obtained this way). The goal is that in the end, the true quantile is guaranteed to be in one partition, and there should be at most one of the desired quantiles in each partition
Within each of the partitions, perform a QuickSelect (in O(n)) to find the true quantile.

Each of the steps is in linear time. The most costly step is part 3, as it will require the whole data set to be redistributed, so it generates O(n) network traffic. You can probably optimize the process by choosing "alternate" quantiles for the first iteration. Say, you want to find the global median. You can't find it in a linear process easily, but you can probably narrow it down to 1/kth of the data set, when it is split into k partitions. So instead of having each node report its median, have each node additionally report the objects at (k-1)/(2k) and (k+1)/(2k). This should allow you to narrow down the range of values where the true median must lie signficantly. So in the next step, you can each node send those objects that are within the desired range to a single master node, and choose the median within this range only.

answered Sep 30 '22 03:09

Has QUIT--Anony-Mousse

Related questions
                            
                                Unable to exit Hive
                            
                                /bin/bash: /bin/java: No such file or directory error in Yarn apps in MacOS
                            
                                Insert data into hive table
                            
                                Partition columns when inserting into a Hive table from a select
                            
                                All three constructors of org.apache.hadoop.mapreduce.Job are deprecated, what is the best way to construct a Job class?
                            
                                Hadoop Streaming : Chaining Jobs
                            
                                Spark: run InputFormat as singleton
                            
                                Spark - java IOException :Failed to create local dir in /tmp/blockmgr*
                            
                                Hadoop - FileSystem.listFiles - not listing directories
                            
                                MultipleOutputFormat in hadoop
                            
                                Dealing with an incompatible version change of a serialization framework
                            
                                Slurm: What is the difference for code executing under salloc vs srun
                            
                                What is the correct way to start/stop spark streaming jobs in yarn?
                            
                                How to append data to an existing parquet file
                            
                                Hadoop: How can i merge reducer outputs to a single file? [duplicate]
                            
                                Airflow SparkSubmitOperator - How to spark-submit in another server
                            
                                Hadoop DistCp using wildcards?
                            
                                HiveQL UNION ALL
                            
                                Why do we need Hadoop passwordless ssh?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Computing median in map reduce

Tags:

statistics

hadoop

apache-pig

median

mapreduce

learner

People also ask

2 Answers

Chris White

Has QUIT--Anony-Mousse

Recent Activity

Donate For Us