I am sorting 10+ million uint64_t
s with RGB data from .RAW files and 79% of my C program time is spent in qsort
. I am looking for a faster sort for this specific data type.
Being RAW graphical data, the numbers are very random and ~80% unique. No partial sorting or runs of sorted data can be expected. The 4 uint16_t
s inside the uint64_t
are R, G, B and zero (possibly a small count <= ~20).
I have the simplest comparison function I can think of using unsigned long long
s (you CANNOT just subtract them):
qsort(hpidx, num_pix, sizeof(uint64_t), comp_uint64);
...
int comp_uint64(const void *a, const void *b) {
if(*((uint64_t *)a) > *((uint64_t *)b)) return(+1);
if(*((uint64_t *)a) < *((uint64_t *)b)) return(-1);
return(0);
} // End Comp_uint64().
There was a very interesting "Programming Puzzles & Code Golf" on StackExchange, but they used float
s. Then there are QSort, RecQuick, heap, stooge, tree, radix...
The swenson/sort looked interesting but had no (obvious) support for my datatype, uint64_t
. And the "quick sort" time was the best. Some sources say the system qsort
can be anything, not necessarily "Quick Sort".
A C++ sort bypasses the generic casting of void pointers and realizes great improvements in performance over C. There has to be an optimized method to slam U8s through a 64bit processor at warp speed.
System/compiler info:
I am currently using the GCC with Strawberry Perl
gcc version 4.9.2 (x86_64-posix-sjlj, built by strawberryperl.com
Intel 2700K Sandy Bridge CPU, 32GB DDR3
windows 7/64 pro
gcc -D__USE_MINGW_ANSI_STDIO -O4 -ffast-math -m64 -Ofast -march=corei7-avx -mtune=corei7 -Ic:/bin/xxHash-master -Lc:/bin/xxHash-master c:/bin/stddev.c -o c:/bin/stddev.g6.exe
qsort
, QSORT()
!Tried to use Michael Tokarev's inline qsort
.
"READY-TO-USE"? From qsort.h
documentation
-----------------------------
* Several ready-to-use examples:
*
* Sorting array of integers:
* void int_qsort(int *arr, unsigned n) {
* #define int_lt(a,b) ((*a)<(*b))
* QSORT(int, arr, n, int_lt);
--------------------------------
Change from type "int" to "uint64_t"
compile error on TYPE???
c:/bin/bpbfct.c:586:8: error: expected expression before 'uint64_t'
QSORT(uint64_t, hpidx, num_pix, islt);
I can't find a real, compiling, working example program, just comments with the "general concept"
#define QSORT_TYPE uint64_t
#define islt(a,b) ((*a)<(*b))
uint64_t *QSORT_BASE;
int QSORT_NELT;
hpidx=(uint64_t *) calloc(num_pix+2, sizeof(uint64_t)); // Hash . PIDX
QSORT_BASE = hpidx;
QSORT_NELT = num_pix; // QSORT_LT is function QSORT_LT()
QSORT(uint64_t, hpidx, num_pix, islt);
//QSORT(uint64_t *, hpidx, num_pix, QSORT_LT); // QSORT_LT mal-defined?
//qsort(hpidx, num_pix, sizeof(uint64_t), comp_uint64); // << WORKS
The "ready-to-use" examples use types of int
, char *
and struct elt
. Isn't uint64_t
a type?? Try long long
QSORT(long long, hpidx, num_pix, islt);
c:/bin/bpbfct.c:586:8: error: expected expression before 'long'
QSORT(long long, hpidx, num_pix, islt);
RADIXSORT
:Results: RADIX_SORT is RADICAL!
I:\br3\pf.249465>grep "Event" bb12.log | grep -i Sort
<< 1.40 sec average
4) Time=1.411 sec = 49.61%, Event RADIX_SORT , hits=1
4) Time=1.396 sec = 49.13%, Event RADIX_SORT , hits=1
4) Time=1.392 sec = 49.15%, Event RADIX_SORT , hits=1
16) Time=1.414 sec = 49.12%, Event RADIX_SORT , hits=1
I:\br3\pf.249465>grep "Event" bb11.log | grep -i Sort
<< 5.525 sec average = 3.95 time slower
4) Time=5.538 sec = 86.34%, Event QSort , hits=1
4) Time=5.519 sec = 79.41%, Event QSort , hits=1
4) Time=5.519 sec = 79.02%, Event QSort , hits=1
4) Time=5.563 sec = 79.49%, Event QSort , hits=1
4) Time=5.684 sec = 79.83%, Event QSort , hits=1
4) Time=5.509 sec = 79.30%, Event QSort , hits=1
3.94 times faster than whatever sort qsort
out of the box uses!
And, even more importantly, there was actual, working code, not just 80% of what you need given by some Guru who assumes you know everything they know and can fill in the other 20%.
Fantastic solution! Thanks Louis Ricci!
I would use Radix Sort with an 8bit radix. For 64bit values a well optimized radix sort will have to iterate over the list 9 times (one to precalculate the counts and offsets and 8 for 64bits/8bits). 9*N time and 2*N space (using a shadow array).
Here's what an optimized radix sort would look like.
typedef union {
struct {
uint32_t c8[256];
uint32_t c7[256];
uint32_t c6[256];
uint32_t c5[256];
uint32_t c4[256];
uint32_t c3[256];
uint32_t c2[256];
uint32_t c1[256];
};
uint32_t counts[256 * 8];
} rscounts_t;
uint64_t * radixSort(uint64_t * array, uint32_t size) {
rscounts_t counts;
memset(&counts, 0, 256 * 8 * sizeof(uint32_t));
uint64_t * cpy = (uint64_t *)malloc(size * sizeof(uint64_t));
uint32_t o8=0, o7=0, o6=0, o5=0, o4=0, o3=0, o2=0, o1=0;
uint32_t t8, t7, t6, t5, t4, t3, t2, t1;
uint32_t x;
// calculate counts
for(x = 0; x < size; x++) {
t8 = array[x] & 0xff;
t7 = (array[x] >> 8) & 0xff;
t6 = (array[x] >> 16) & 0xff;
t5 = (array[x] >> 24) & 0xff;
t4 = (array[x] >> 32) & 0xff;
t3 = (array[x] >> 40) & 0xff;
t2 = (array[x] >> 48) & 0xff;
t1 = (array[x] >> 56) & 0xff;
counts.c8[t8]++;
counts.c7[t7]++;
counts.c6[t6]++;
counts.c5[t5]++;
counts.c4[t4]++;
counts.c3[t3]++;
counts.c2[t2]++;
counts.c1[t1]++;
}
// convert counts to offsets
for(x = 0; x < 256; x++) {
t8 = o8 + counts.c8[x];
t7 = o7 + counts.c7[x];
t6 = o6 + counts.c6[x];
t5 = o5 + counts.c5[x];
t4 = o4 + counts.c4[x];
t3 = o3 + counts.c3[x];
t2 = o2 + counts.c2[x];
t1 = o1 + counts.c1[x];
counts.c8[x] = o8;
counts.c7[x] = o7;
counts.c6[x] = o6;
counts.c5[x] = o5;
counts.c4[x] = o4;
counts.c3[x] = o3;
counts.c2[x] = o2;
counts.c1[x] = o1;
o8 = t8;
o7 = t7;
o6 = t6;
o5 = t5;
o4 = t4;
o3 = t3;
o2 = t2;
o1 = t1;
}
// radix
for(x = 0; x < size; x++) {
t8 = array[x] & 0xff;
cpy[counts.c8[t8]] = array[x];
counts.c8[t8]++;
}
for(x = 0; x < size; x++) {
t7 = (cpy[x] >> 8) & 0xff;
array[counts.c7[t7]] = cpy[x];
counts.c7[t7]++;
}
for(x = 0; x < size; x++) {
t6 = (array[x] >> 16) & 0xff;
cpy[counts.c6[t6]] = array[x];
counts.c6[t6]++;
}
for(x = 0; x < size; x++) {
t5 = (cpy[x] >> 24) & 0xff;
array[counts.c5[t5]] = cpy[x];
counts.c5[t5]++;
}
for(x = 0; x < size; x++) {
t4 = (array[x] >> 32) & 0xff;
cpy[counts.c4[t4]] = array[x];
counts.c4[t4]++;
}
for(x = 0; x < size; x++) {
t3 = (cpy[x] >> 40) & 0xff;
array[counts.c3[t3]] = cpy[x];
counts.c3[t3]++;
}
for(x = 0; x < size; x++) {
t2 = (array[x] >> 48) & 0xff;
cpy[counts.c2[t2]] = array[x];
counts.c2[t2]++;
}
for(x = 0; x < size; x++) {
t1 = (cpy[x] >> 56) & 0xff;
array[counts.c1[t1]] = cpy[x];
counts.c1[t1]++;
}
free(cpy);
return array;
}
EDIT this implementation was based on a JavaScript version Fastest way to sort 32bit signed integer arrays in JavaScript?
Here's the IDEONE for the C radix sort http://ideone.com/JHI0d9
I see a few options, roughly in order of easiest to hardest.
-flto
switch. This may get the compiler to inline your comparison function. It's too easy not to try.std::sort
. I know your code is in C, but you can make a small module that only sorts and provides a C interface. You're already using a toolchain that has great C++ support.qsort
.Note you can also do your comparison with a single branch instead of two. Just find out which is bigger, then subtract.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With