I have a set of double-precision data and I need their list to be always sorted. What is the best algorithm to sort the data as it is being added?
As best I mean least Big-O in data count, Small-O in data count (worst case scenario), and least Small-O in the space needed, in that order if possible.
The set size is really variable, from a small number (30) to lots of data (+10M).
Quicksort. Quicksort is one of the most efficient sorting algorithms, and this makes of it one of the most used as well. The first thing to do is to select a pivot number, this number will separate the data, on its left are the numbers smaller than it and the greater numbers on the right.
That's because the constant factor hidden in the big-Θ notation for quicksort is quite good. In practice, quicksort outperforms merge sort, and it significantly outperforms selection sort and insertion sort.
But since it has the upper hand in the average cases for most inputs, Quicksort is generally considered the “fastest” sorting algorithm.
Building a self-balancing binary tree like a red-black tree or AVL tree will allow for Θ(lg n) insertion and removal, and Θ(n) retrieval of all elements in sorted order (by doing a depth-first traversal), with Θ(n) memory usage. The implementation is somewhat complex, but they're efficient, and most languages will have library implementations, so they're a good first choice in most cases.
Additionally, retreiving the i-th element can be done by annotating each edge (or, equivalently, node) in the tree with the total number of nodes below it. Then one can find the i-th element in Θ(lg n) time and Θ(1) space with something like:
node *find_index(node *root, int i) {
while (node) {
if (i == root->left_count)
return root;
else if (i < root->left_count)
root = root->left;
else {
i -= root->left_count + 1;
root = root->right;
}
}
return NULL; // i > number of nodes
}
An implementation that supports this can be found in debian's libavl; unfortunately, the maintainer's site seems down, but it can be retrieved from debian's servers.
The structure that is used for indexes of database programs is a B+ Tree. It is a balanced bucketed n-ary tree.
From Wikipedia:
For a b-order B+ tree with h levels of index:
I use this in my program. You can add your data to the structure as it comes and you can always traverse it in order, front to back or back to front, or search quickly for any value. If you don't find the value, you will have the insertion point where you can add the value.
You can optimize the structure for your program by playing around with b, the size of the buckets.
An interesting presentation about B+ trees: Tree-Structured Indexes
You can get the entire code in C++.
Edit: Now I see your comment that your requirement to know the "i-th sorted element in the set" is an important one. All of a sudden, that makes many data structures less than optimal.
You are probably best off with a SortedList or even better, a SortedDictionary. See the article: Squeezing more performance from SortedList. Both structures have a GetKey function that will return the i-th element.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With