One of my friend has been asked with a question <blockquote> Retrieving the max top 100 numbers from one hundred million of numbers </blockquote> in a recent job interview. Do you have any idea to come up with an efficient way to solve it?

Ok, here is a really stupid answer, but it is a valid one: <ul> <li>Load all 100 million entries into an array</li> <li>Call some quick sort implementation on it</li> <li>Take last 100 items (it sorts ascending), or first 100 if you can sort descending.</li> </ul> Reasoning: <ul> <li>There is no context on the question, so efficiency can be argued - what IS efficient? Computer time or programmer time?</li> <li>This method is implementable very fast.</li> <li>100 million entries - numbers, are just a couple of hundred mb, so every decent workstaiton can simply run that.</li> </ul> It is an ok solution for some sort of one time operation. It would suck running it x times per second or something. But then, we need more context - as mclientk also had with his simple SQL statement - assuming 100 million numbersdo not exist in memory is a feasible question, because... they may come from a database and most of the time will, when talking about business relevant numbers. As such, the question is really hard to answer - efficiency first has to be defined.

Retrieving the top 100 numbers from one hundred million of numbers

2 Answers

Run them all through a min-heap of size 100: for each input number k, replace the current min m with max(k, m). Afterwards the heap holds the 100 largest inputs.

A search engine like Lucene can use this method, with refinements, to choose the most-relevant search answers.

Edit: I fail the interview -- I got the details wrong twice (after having done this before, in production). Here's code to check it; it's almost the same as Python's standard heapq.nlargest():

import heapq  def funnel(n, numbers):     if n == 0: return []     heap = numbers[:n]     heapq.heapify(heap)     for k in numbers[n:]:         if heap[0] < k:             heapq.heapreplace(heap, k)     return heap  >>> funnel(4, [3,1,4,1,5,9,2,6,5,3,5,8]) [5, 8, 6, 9]

101

answered Sep 26 '22 00:09

Darius Bacon

Ok, here is a really stupid answer, but it is a valid one:

Load all 100 million entries into an array
Call some quick sort implementation on it
Take last 100 items (it sorts ascending), or first 100 if you can sort descending.

Reasoning:

There is no context on the question, so efficiency can be argued - what IS efficient? Computer time or programmer time?
This method is implementable very fast.
100 million entries - numbers, are just a couple of hundred mb, so every decent workstaiton can simply run that.

It is an ok solution for some sort of one time operation. It would suck running it x times per second or something. But then, we need more context - as mclientk also had with his simple SQL statement - assuming 100 million numbersdo not exist in memory is a feasible question, because... they may come from a database and most of the time will, when talking about business relevant numbers.

As such, the question is really hard to answer - efficiency first has to be defined.

answered Sep 26 '22 00:09

TomTom

Related questions
                            
                                Finding a single number in a list [duplicate]
                            
                                What is the best way to get the minimum or maximum value from an Array of numbers?
                            
                                How to implement a Median-heap
                            
                                Find maximum possible time HH:MM by permuting four given digits
                            
                                How to check if line segment intersects a rectangle?
                            
                                What is the problem name for Traveling salesman problem(TSP) without considering going back to starting point?
                            
                                Empirically estimating big-oh time efficiency
                            
                                How does Google Docs deal with editing collisions?
                            
                                Given an audio stream, find when a door slams (sound pressure level calculation?)
                            
                                Why are hash table expansions usually done by doubling the size?
                            
                                Genetic algorithm resource [closed]
                            
                                Why do we need prefix, postfix notation
                            
                                Good examples, articles, books for understanding dynamic programming [closed]
                            
                                Segmented Sieve of Eratosthenes?
                            
                                Tracing and Returning a Path in Depth First Search
                            
                                How to rotate a table 45 degrees and save the result into another table?
                            
                                How Could One Implement the K-Means++ Algorithm?
                            
                                Which algorithm is faster O(N) or O(2N)?
                            
                                How to find smallest substring which contains all characters from a given string?
                            
                                What is the difference between an on-line and off-line algorithm?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Retrieving the top 100 numbers from one hundred million of numbers

Tags:

algorithm

didxga

People also ask

2 Answers

Darius Bacon

TomTom

Recent Activity

Donate For Us