Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting until we have the lowest half of the sorted array

I am currently trying to get the values that are in the lowest half of an array of data. This array in unsorted at first.

From this:

{4,6,9,3,8,5}

To this :

{3,4,5,6,9,8} or {3,4,5}

An easy solution would be to sort the array (using quicksort) and then use only the values stored in the first half of the sorted array. However, since quicksort and most efficient sorting algorithms will sort the entire array while I only need the first 50%, this seems like a waste of ressources. Note that performance is an issue in this project.

Knowing that a full sort is O(n log n) and that a sort that stops after it finds the lowest element is O(n), I can easily build a simple algorithm that will have a complexity of n/2 * n to find the lowest 50%. But is that really better than a full quicksort?

To be clear, what would be the best sort to use if we want only the lowest half of the values in an array? If the 50% was smaller (like 1%), a sequential search of the lowest elements would of course be the fastest solution, but at what % does it become slower than a quicksort?

I am coding in C++ and using vectors, but this question should be pretty general.

like image 596
Alex Millette Avatar asked Dec 06 '22 13:12

Alex Millette


2 Answers

#include <algorithm>
std::partial_sort(start, middle, end);
like image 135
Mike Seymour Avatar answered Dec 09 '22 14:12

Mike Seymour


If you don't need the lower half sorted, use std::nth_element. If you need the lower half sorted and the vector contains fewer than 100,000 elements, use std::partial_sort, if your vector is larger then use std::nth_element to partition the vector into lower half and upper half, then use std::qsort on the lower half. I've confirmed this on an Intel Xeon X5570 @ 2.93GHz running CentOS with g++ 4.4.3 and give timings at the end of this answer. Scott Meyers and others have found it astonishing that std::nth_element followed by std::qsort can be that much faster than std::partial_sort for large vectors:

http://www.velocityreviews.com/forums/t745258-nth_element-sort-versus-partial_sort.html

If you just want the lowest half of the values and don't need those to be sorted then std::nth_element is fastest (complexity is linear).

http://www.cplusplus.com/reference/algorithm/nth_element/

// nth_element example (modified to partition into lower/upper halves)
#include <iostream>
#include <algorithm>
#include <vector>
using namespace std;

int main () {
    vector<int> myvector;
    vector<int>::iterator it;

    // set some values:
    for (int i=1; i<10; i++) myvector.push_back(i);   // 1 2 3 4 5 6 7 8 9

    random_shuffle (myvector.begin(), myvector.end());

    // using default comparison (operator <):
    nth_element (myvector.begin(), myvector.begin()+myvector.size()/2, myvector.end());

    // print out content:
    cout << "myvector contains:";
    for (it=myvector.begin(); it!=myvector.end(); ++it)
        cout << " " << *it;

    cout << endl;

    return 0;
}

On an Intel Xeon X5570 @ 2.93GHz running CentOS and using g++ 4.4.3 I measure the following times. It is clear from the data that std::nth_element is linear and faster than std::partial_sort for all sizes, and 94 times faster when N is 1 billion elements.

N =       1000 nth_element   0.0000082 sec
N =       1000 nth + qsort   0.0001114 sec
N =       1000 partial_sort  0.0000438 sec

N =      10000 nth_element   0.0000592 sec
N =      10000 nth + qsort   0.0005639 sec
N =      10000 partial_sort  0.0005271 sec

N =     100000 nth_element   0.00095 sec
N =     100000 nth + qsort   0.00683 sec
N =     100000 partial_sort  0.00697 sec

N =    1000000 nth_element   0.0086 sec
N =    1000000 nth + qsort   0.0831 sec
N =    1000000 partial_sort  0.1227 sec

N =   10000000 nth_element   0.0700 sec
N =   10000000 nth + qsort   0.9307 sec
N =   10000000 partial_sort  2.7006 sec

N =  100000000 nth_element   0.8147 sec
N =  100000000 nth + qsort  10.7602 sec
N =  100000000 partial_sort 56.7105 sec

N = 1000000000 nth_element   10.055 sec
N = 1000000000 nth + qsort  123.703 sec
N = 1000000000 partial_sort 947.949 sec
like image 20
amdn Avatar answered Dec 09 '22 13:12

amdn