There is a file that contains 10G(1000000000) number of integers, please find the Median of these integers. you are given 2G memory to do this. Can anyone come up with an reasonable way? thanks!
Make one pass over the data calculating an average (arithmetic mean. Call that number the provisional median. Then make a second pass over the data. If the data point is less than the provisional median, reduce the provisional median by one.
Tip: For large data sets, divide the number of items by 2, then subtract 1 to find the number that should be above and the number that should be below. For example, 100/2 = 50. 50 – 1 = 49. The middle two numbers will have 49 items above and 49 below.
The idea is to use max heap and min-heap data structure to store the elements of the lower half and higher half respectively. We use a max heap to represent elements that are less than effective median, and a min heap to represent elements that are greater than effective median.
You can use the Medians of Medians algorithm.
Create an array of 8-byte longs that has 2^16 entries. Take your input numbers, shift off the bottom sixteen bits, and create a histogram.
Now you count up in that histogram until you reach the bin that covers the midpoint of the values.
Pass through again, ignoring all numbers that don't have that same set of top bits, and make a histogram of the bottom bits.
Count up through that histogram until you reach the bin that covers the midpoint of the (entire list of) values.
Now you know the median, in O(n)
time and O(1)
space (in practice, under 1 MB).
Here's some sample Scala code that does this:
def medianFinder(numbers: Iterable[Int]) = { def midArgMid(a: Array[Long], mid: Long) = { val cuml = a.scanLeft(0L)(_ + _).drop(1) cuml.zipWithIndex.dropWhile(_._1 < mid).head } val topHistogram = new Array[Long](65536) var count = 0L numbers.foreach(number => { count += 1 topHistogram(number>>>16) += 1 }) val (topCount,topIndex) = midArgMid(topHistogram, (count+1)/2) val botHistogram = new Array[Long](65536) numbers.foreach(number => { if ((number>>>16) == topIndex) botHistogram(number & 0xFFFF) += 1 }) val (botCount,botIndex) = midArgMid(botHistogram, (count+1)/2 - (topCount-topHistogram(topIndex))) (topIndex<<16) + botIndex }
and here it is working on a small set of input data:
scala> medianFinder(List(1,123,12345,1234567,123456789)) res18: Int = 12345
If you have 64 bit integers stored, you can use the same strategy in 4 passes instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With