Is there anyway to optimize sort on this kind of data?

Information about the data:

Arrays are 1176 elements long
Keys are between 750 000 and 135 000 000; also 0 is possible
There are a lot of duplicates, in every array there are only between 48 and 100 different keys but it's impossible to predict which values out of whole range those will be
There are a lot of long sorted subsequences, most arrays consists of anywhere between 33 and 80 sorted subsequences
The smallest element is 0; number of 0's is predictable and in very narrow range, about 150 per array

What I tried so far:

stdlib.h qsort;

this is slow, right now my function spends 0.6s on sorting per execution, with stdlib.h qsort it's 1.0s; this has the same performance as std::sort
Timsort;

I tried this: https://github.com/swenson/sort and this: http://code.google.com/p/timsort/source/browse/trunk/timSort.c?spec=svn17&r=17; both were significantly slower than stdlib qsort
http://www.ucw.cz/libucw/ ;

their combination of quick sort and insert sort is the fastest for my data so far; I experimented with various settings and pivot as middle element (not median of 3) and insert sort starting with 28 element sub arrays (not 8 as default) gives the best performance
shell sort;

simple implementation with gaps from this article: http://en.wikipedia.org/wiki/Shellsort; it was decent, although slower than stdlib qsort

My thoughts are that qsort does a lot of swapping around and ruins (ie reverse) sorted subsequences so there should be some way to improve on it by exploiting structure of the data, unfortunately all my tries fail so far.
If you are curious what kind of data is that, those are sets of poker hand evaluated on various boards already sorted on previous board (this is where sorted subsequences come from).

The function is in C. I use Visual Studio 2010. Any ideas ?

Sample data: http://pastebin.com/kKUdnU3N
Sample full execution (1176 sorts): https://dl.dropbox.com/u/86311885/out.zip

224

asked Jun 19 '12 02:06

Piotr Lopusiewicz

1 Answers

What if you first do a pass through the array to group the numbers to get rid of duplicates. Each number could go into a hashtable where the number is the key, and the number of times it appears is the value. So if the number 750 000 appears 57 times in the array, the hashtable would hold key=750000; value=57. Then you can sort the much smaller hashtable by keys, which should be less than 100 elements long.

With this you only need to make one pass through the array, and another pass through the much smaller hashtable key list. This should avoid most of the swaps and comparisons.

answered Sep 30 '22 18:09

Oleksi

Related questions
                            
                                Safe to pass pointer to auto variable to function?
                            
                                Reading newline from previous input when reading from keyboard with scanf()
                            
                                Using div with unsigned integers
                            
                                How to properly setup TravisCI For a simple C project
                            
                                Where can I find the source code for math.h functions? [closed]
                            
                                C equivalent to C++ decltype
                            
                                Can we unit test memory allocation?
                            
                                Why is movl preferred to movb when translating a C downcast from unsigned int to unsigned char?
                            
                                Does the OS (POSIX) flush a memory-mapped file if the process is SIGKILLed?
                            
                                Properties file library for C (or C++)
                            
                                Empty or "flush" a file descriptor without read()?
                            
                                How to separate CUDA code into multiple files
                            
                                Are there any lint tools for C and C++ that check formatting?
                            
                                C/C++ RPC Tutorial for Linux [closed]
                            
                                epoll_wait: maxevents
                            
                                What is the result of casting float +INF, -INF, and NAN to integer in C?
                            
                                What is a core dump file in Linux? What information does it provide?
                            
                                Is there a graceful way to handle toggling between fullscreen and windowed mode in a Windows OpenGL application?
                            
                                Two asterisks before variable name
                            
                                which c/c++ library can be used for handling wifi connections for linux?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there anyway to optimize sort on this kind of data?

Tags:

c

algorithm

sorting