Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to obtain the largest X numbers from a very large unsorted list?

I'm trying to obtain the top say, 100 scores from a list of scores being generated by my program. Unfortuatly the list is huge (on the order of millions to billions) so sorting is a time intensive portion of the program.

Whats the best way of doing the sorting to get the top 100 scores?

The only two methods i can think of so far is either first generating all the scores into a massive array and then sorting it and taking the top 100. Or second, generating X number of scores, sorting it and truncating the top 100 scores then continue generating more scores, adding them to the truncated list and then sorting it again.

Either way I do it, it still takes more time than i would like, any ideas on how to do it in an even more efficient way? (I've never taken programming courses before, maybe those of you with comp sci degrees know about efficient algorithms to do this, at least that's what I'm hoping).

Lastly, whats the sorting algorithm used by the standard sort() function in c++?

Thanks,

-Faken

Edit: Just for anyone who is curious...

I did a few time trials on the before and after and here are the results:

Old program (preforms sorting after each outer loop iteration):

top 100 scores: 147 seconds
top  10 scores: 147 seconds
top   1 scores: 146 seconds
Sorting disabled: 55 seconds

new program (implementing tracking of only top scores and using default sorting function):

top 100 scores: 350 seconds <-- hmm...worse than before
top  10 scores: 103 seconds 
top   1 scores:  69 seconds 
Sorting disabled: 51 seconds

new rewrite (optimizations in data stored, hand written sorting algorithm):

top 100 scores: 71 seconds <-- Very nice!
top  10 scores: 52 seconds
top   1 scores: 51 seconds
Sorting disabled: 50 seconds

Done on a core 2, 1.6 GHz...I can't wait till my core i7 860 arrives...

There's a lot of other even more aggressive optimizations for me to work out (mainly in the area of reducing the number of iterations i run), but as it stands right now, the speed is more than good enough, i might not even bother to work out those algorithm optimizations.

Thanks to eveyrone for their input!

like image 701
Faken Avatar asked Oct 21 '09 19:10

Faken


4 Answers

  1. take the first 100 scores, and sort them in an array.
  2. take the next score, and insertion-sort it into the array (starting at the "small" end)
  3. drop the 101st value
  4. continue with the next value, at 2, until done

Over time, the list will resemble the 100 largest value more and more, so more often, you find that the insertion sort immediately aborts, finding that the new value is smaller than the smallest value of the candidates for the top 100.

like image 51
Martin v. Löwis Avatar answered Oct 25 '22 10:10

Martin v. Löwis


You can do this in O(n) time, without any sorting, using a heap:

#!/usr/bin/python

import heapq

def top_n(l, n):
    top_n = []

    smallest = None

    for elem in l:
        if len(top_n) < n:
            top_n.append(elem)
            if len(top_n) == n:
                heapq.heapify(top_n)
                smallest = heapq.nsmallest(1, top_n)[0]
        else:
            if elem > smallest:
                heapq.heapreplace(top_n, elem)
                smallest = heapq.nsmallest(1, top_n)[0]

    return sorted(top_n)


def random_ints(n):
    import random
    for i in range(0, n):
        yield random.randint(0, 10000)

print top_n(random_ints(1000000), 100)

Times on my machine (Core2 Q6600, Linux, Python 2.6, measured with bash time builtin):

  • 100000 elements: .29 seconds
  • 1000000 elements: 2.8 seconds
  • 10000000 elements: 25.2 seconds

Edit/addition: In C++, you can use std::priority_queue in much the same way as Python's heapq module is used here. You'll want to use the std::greater ordering instead of the default std::less, so that the top() member function returns the smallest element instead of the largest one. C++'s priority queue doesn't have the equivalent of heapreplace, which replaces the top element with a new one, so instead you'll want to pop the top (smallest) element and then push the newly seen value. Other than that the algorithm translates quite cleanly from Python to C++.

like image 42
Jack Lloyd Avatar answered Oct 25 '22 09:10

Jack Lloyd


Here's the 'natural' C++ way to do this:

std::vector<Score> v;
// fill in v
std::partial_sort(v.begin(), v.begin() + 100, v.end(), std::greater<Score>());
std::sort(v.begin(), v.begin() + 100);

This is linear in the number of scores.

The algorithm used by std::sort isn't specified by the standard, but libstdc++ (used by g++) uses an "adaptive introsort", which is essentially a median-of-3 quicksort down to a certain level, followed by an insertion sort.

like image 24
Richard Smith Avatar answered Oct 25 '22 09:10

Richard Smith


Declare an array where you can put the 100 best scores. Loop through the huge list and check for each item if it qualifies to be inserted in the top 100. Use a simple insert sort to add an item to the top list.

Something like this (C# code, but you get the idea):

Score[] toplist = new Score[100];
int size = 0;
foreach (Score score in hugeList) {
   int pos = size;
   while (pos > 0 && toplist[pos - 1] < score) {
      pos--;
      if (pos < 99) toplist[pos + 1] = toplist[pos];
   }
   if (size < 100) size++;
   if (pos < size) toplist[pos] = score;
}

I tested it on my computer (Code 2 Duo 2.54 MHz Win 7 x64) and I can process 100.000.000 items in 369 ms.

like image 41
Guffa Avatar answered Oct 25 '22 09:10

Guffa