Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieving the top 100 numbers from one hundred million of numbers

Tags:

algorithm

One of my friend has been asked with a question

Retrieving the max top 100 numbers from one hundred million of numbers

in a recent job interview. Do you have any idea to come up with an efficient way to solve it?

like image 531
didxga Avatar asked Mar 31 '10 05:03

didxga


People also ask

How do you find the 100 largest numbers out of an array of 1 billion numbers?

Use Heap building the MAX heap will take O(N),then the top 100 max numbers will be in the top of the Heap, all you need is to get them out from the heap(100 X O(Log(N)).

How to write one hundred million in numbers?

100,000,000 (one hundred million) is the natural number following 99,999,999 and preceding 100,000,001.


2 Answers

Run them all through a min-heap of size 100: for each input number k, replace the current min m with max(k, m). Afterwards the heap holds the 100 largest inputs.

A search engine like Lucene can use this method, with refinements, to choose the most-relevant search answers.

Edit: I fail the interview -- I got the details wrong twice (after having done this before, in production). Here's code to check it; it's almost the same as Python's standard heapq.nlargest():

import heapq  def funnel(n, numbers):     if n == 0: return []     heap = numbers[:n]     heapq.heapify(heap)     for k in numbers[n:]:         if heap[0] < k:             heapq.heapreplace(heap, k)     return heap  >>> funnel(4, [3,1,4,1,5,9,2,6,5,3,5,8]) [5, 8, 6, 9] 
like image 101
Darius Bacon Avatar answered Sep 26 '22 00:09

Darius Bacon


Ok, here is a really stupid answer, but it is a valid one:

  • Load all 100 million entries into an array
  • Call some quick sort implementation on it
  • Take last 100 items (it sorts ascending), or first 100 if you can sort descending.

Reasoning:

  • There is no context on the question, so efficiency can be argued - what IS efficient? Computer time or programmer time?
  • This method is implementable very fast.
  • 100 million entries - numbers, are just a couple of hundred mb, so every decent workstaiton can simply run that.

It is an ok solution for some sort of one time operation. It would suck running it x times per second or something. But then, we need more context - as mclientk also had with his simple SQL statement - assuming 100 million numbersdo not exist in memory is a feasible question, because... they may come from a database and most of the time will, when talking about business relevant numbers.

As such, the question is really hard to answer - efficiency first has to be defined.

like image 20
TomTom Avatar answered Sep 26 '22 00:09

TomTom