Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there anyway to optimize sort on this kind of data?

I am sorting array of integers keys.

Information about the data:

  • Arrays are 1176 elements long
  • Keys are between 750 000 and 135 000 000; also 0 is possible
  • There are a lot of duplicates, in every array there are only between 48 and 100 different keys but it's impossible to predict which values out of whole range those will be
  • There are a lot of long sorted subsequences, most arrays consists of anywhere between 33 and 80 sorted subsequences
  • The smallest element is 0; number of 0's is predictable and in very narrow range, about 150 per array

What I tried so far:

  1. stdlib.h qsort;

    this is slow, right now my function spends 0.6s on sorting per execution, with stdlib.h qsort it's 1.0s; this has the same performance as std::sort

  2. Timsort;

    I tried this: https://github.com/swenson/sort and this: http://code.google.com/p/timsort/source/browse/trunk/timSort.c?spec=svn17&r=17; both were significantly slower than stdlib qsort

  3. http://www.ucw.cz/libucw/ ;

    their combination of quick sort and insert sort is the fastest for my data so far; I experimented with various settings and pivot as middle element (not median of 3) and insert sort starting with 28 element sub arrays (not 8 as default) gives the best performance

  4. shell sort;

    simple implementation with gaps from this article: http://en.wikipedia.org/wiki/Shellsort; it was decent, although slower than stdlib qsort


My thoughts are that qsort does a lot of swapping around and ruins (ie reverse) sorted subsequences so there should be some way to improve on it by exploiting structure of the data, unfortunately all my tries fail so far.
If you are curious what kind of data is that, those are sets of poker hand evaluated on various boards already sorted on previous board (this is where sorted subsequences come from).

The function is in C. I use Visual Studio 2010. Any ideas ?

Sample data: http://pastebin.com/kKUdnU3N
Sample full execution (1176 sorts): https://dl.dropbox.com/u/86311885/out.zip

like image 224
Piotr Lopusiewicz Avatar asked Jun 19 '12 02:06

Piotr Lopusiewicz


People also ask

How do you optimize a sort?

You should look into QuickSort or MergeSort if you want faster sorting algorithms. Unlike InsertionSort (and SelectionSort), they are recursive, but still fairly easy to implement. You can find many examples if you look around on the internet.

Which sort will you use if you want to optimize?

One way to optimize sort queries on a particular field is to use index sorting and sort the whole index in this field. If the index is sorted by a field, its doc values are also sorted.

Which is most efficient for sorting large data sets?

Quick sort is the better suited for large data sets. [8]It is the fastest and efficient algorithm for large sets of data. But it is inefficient if the elements in the list are already sorted which results in the worst case time complexity of O(n2).


1 Answers

What if you first do a pass through the array to group the numbers to get rid of duplicates. Each number could go into a hashtable where the number is the key, and the number of times it appears is the value. So if the number 750 000 appears 57 times in the array, the hashtable would hold key=750000; value=57. Then you can sort the much smaller hashtable by keys, which should be less than 100 elements long.

With this you only need to make one pass through the array, and another pass through the much smaller hashtable key list. This should avoid most of the swaps and comparisons.

like image 78
Oleksi Avatar answered Sep 30 '22 18:09

Oleksi