Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interview Question: Find Median From Mega Number Of Integers

Tags:

algorithm

There is a file that contains 10G(1000000000) number of integers, please find the Median of these integers. you are given 2G memory to do this. Can anyone come up with an reasonable way? thanks!

like image 813
didxga Avatar asked Aug 26 '10 06:08

didxga


People also ask

How would you go about calculating a median from a dataset too large to store in memory?

Make one pass over the data calculating an average (arithmetic mean. Call that number the provisional median. Then make a second pass over the data. If the data point is less than the provisional median, reduce the provisional median by one.

What if you have a large set of data how do we find the median?

Tip: For large data sets, divide the number of items by 2, then subtract 1 to find the number that should be above and the number that should be below. For example, 100/2 = 50. 50 – 1 = 49. The middle two numbers will have 49 items above and 49 below.

Which one is the best suited data structure to calculate running median?

The idea is to use max heap and min-heap data structure to store the elements of the lower half and higher half respectively. We use a max heap to represent elements that are less than effective median, and a min heap to represent elements that are greater than effective median.


2 Answers

You can use the Medians of Medians algorithm.

like image 35
starblue Avatar answered Sep 18 '22 17:09

starblue


Create an array of 8-byte longs that has 2^16 entries. Take your input numbers, shift off the bottom sixteen bits, and create a histogram.

Now you count up in that histogram until you reach the bin that covers the midpoint of the values.

Pass through again, ignoring all numbers that don't have that same set of top bits, and make a histogram of the bottom bits.

Count up through that histogram until you reach the bin that covers the midpoint of the (entire list of) values.

Now you know the median, in O(n) time and O(1) space (in practice, under 1 MB).

Here's some sample Scala code that does this:

def medianFinder(numbers: Iterable[Int]) = {   def midArgMid(a: Array[Long], mid: Long) = {     val cuml = a.scanLeft(0L)(_ + _).drop(1)     cuml.zipWithIndex.dropWhile(_._1 < mid).head   }   val topHistogram = new Array[Long](65536)   var count = 0L   numbers.foreach(number => {     count += 1     topHistogram(number>>>16) += 1   })   val (topCount,topIndex) = midArgMid(topHistogram, (count+1)/2)   val botHistogram = new Array[Long](65536)   numbers.foreach(number => {     if ((number>>>16) == topIndex) botHistogram(number & 0xFFFF) += 1   })   val (botCount,botIndex) =     midArgMid(botHistogram, (count+1)/2 - (topCount-topHistogram(topIndex)))   (topIndex<<16) + botIndex } 

and here it is working on a small set of input data:

scala> medianFinder(List(1,123,12345,1234567,123456789)) res18: Int = 12345 

If you have 64 bit integers stored, you can use the same strategy in 4 passes instead.

like image 89
Rex Kerr Avatar answered Sep 18 '22 17:09

Rex Kerr