Fastest sort algorithm for millions of UINT64 RGBZ graphics pixels

Question

I am sorting 10+ million uint64_ts with RGB data from .RAW files and 79% of my C program time is spent in qsort. I am looking for a faster sort for this specific data type.

Being RAW graphical data, the numbers are very random and ~80% unique. No partial sorting or runs of sorted data can be expected. The 4 uint16_ts inside the uint64_t are R, G, B and zero (possibly a small count <= ~20).

I have the simplest comparison function I can think of using unsigned long longs (you CANNOT just subtract them):

qsort(hpidx, num_pix, sizeof(uint64_t), comp_uint64); 
...
int comp_uint64(const void *a, const void *b)  {
    if(*((uint64_t *)a) > *((uint64_t *)b))  return(+1);
    if(*((uint64_t *)a) < *((uint64_t *)b))  return(-1);
    return(0);
}  // End Comp_uint64().

There was a very interesting "Programming Puzzles & Code Golf" on StackExchange, but they used floats. Then there are QSort, RecQuick, heap, stooge, tree, radix...

The swenson/sort looked interesting but had no (obvious) support for my datatype, uint64_t. And the "quick sort" time was the best. Some sources say the system qsort can be anything, not necessarily "Quick Sort".

A C++ sort bypasses the generic casting of void pointers and realizes great improvements in performance over C. There has to be an optimized method to slam U8s through a 64bit processor at warp speed.

System/compiler info:

I am currently using the GCC with Strawberry Perl

gcc version 4.9.2 (x86_64-posix-sjlj, built by strawberryperl.com
Intel 2700K Sandy Bridge CPU, 32GB DDR3
windows 7/64 pro

gcc -D__USE_MINGW_ANSI_STDIO -O4 -ffast-math -m64 -Ofast -march=corei7-avx -mtune=corei7 -Ic:/bin/xxHash-master -Lc:/bin/xxHash-master c:/bin/stddev.c -o c:/bin/stddev.g6.exe

First attempt at a better `qsort`, `QSORT()`!

Tried to use Michael Tokarev's inline qsort.

"READY-TO-USE"? From qsort.h documentation

-----------------------------
* Several ready-to-use examples:
 *
 * Sorting array of integers:
 * void int_qsort(int *arr, unsigned n) {
 * #define int_lt(a,b) ((*a)<(*b))
 *   QSORT(int, arr, n, int_lt);
--------------------------------

Change from type "int" to "uint64_t"
compile error on TYPE???
    
    c:/bin/bpbfct.c:586:8: error: expected expression before 'uint64_t'
      QSORT(uint64_t, hpidx, num_pix, islt);

I can't find a real, compiling, working example program, just comments with the "general concept"

#define QSORT_TYPE uint64_t 
#define islt(a,b) ((*a)<(*b))

uint64_t *QSORT_BASE; 
int QSORT_NELT;

hpidx=(uint64_t *) calloc(num_pix+2, sizeof(uint64_t));  // Hash . PIDX
QSORT_BASE = hpidx;
QSORT_NELT = num_pix;  // QSORT_LT is function QSORT_LT()
QSORT(uint64_t, hpidx, num_pix, islt);  
//QSORT(uint64_t *, hpidx, num_pix, QSORT_LT);  // QSORT_LT mal-defined?
//qsort(hpidx, num_pix, sizeof(uint64_t), comp_uint64); // << WORKS

The "ready-to-use" examples use types of int, char * and struct elt. Isn't uint64_t a type?? Try long long

QSORT(long long, hpidx, num_pix, islt); 
c:/bin/bpbfct.c:586:8: error: expected expression before 'long'
 QSORT(long long, hpidx, num_pix, islt);

Next attempt: `RADIXSORT`:

Results: RADIX_SORT is RADICAL!

  I:\br3\pf.249465>grep "Event" bb12.log | grep -i Sort       
 << 1.40 sec average
4) Time=1.411 sec    = 49.61%, Event RADIX_SORT        , hits=1
4) Time=1.396 sec    = 49.13%, Event RADIX_SORT        , hits=1
4) Time=1.392 sec    = 49.15%, Event RADIX_SORT        , hits=1
16) Time=1.414 sec    = 49.12%, Event RADIX_SORT        , hits=1

I:\br3\pf.249465>grep "Event" bb11.log | grep -i Sort 
 << 5.525 sec average  = 3.95 time slower
4) Time=5.538 sec    = 86.34%, Event QSort             , hits=1
4) Time=5.519 sec    = 79.41%, Event QSort             , hits=1
4) Time=5.519 sec    = 79.02%, Event QSort             , hits=1
4) Time=5.563 sec    = 79.49%, Event QSort             , hits=1
4) Time=5.684 sec    = 79.83%, Event QSort             , hits=1
4) Time=5.509 sec    = 79.30%, Event QSort             , hits=1

3.94 times faster than whatever sort qsort out of the box uses!

And, even more importantly, there was actual, working code, not just 80% of what you need given by some Guru who assumes you know everything they know and can fill in the other 20%.

Fantastic solution! Thanks Louis Ricci!

Louis Ricci · Accepted Answer

I would use Radix Sort with an 8bit radix. For 64bit values a well optimized radix sort will have to iterate over the list 9 times (one to precalculate the counts and offsets and 8 for 64bits/8bits). 9*N time and 2*N space (using a shadow array).

Here's what an optimized radix sort would look like.

typedef union {
    struct {
        uint32_t c8[256];
        uint32_t c7[256];
        uint32_t c6[256];
        uint32_t c5[256];
        uint32_t c4[256];
        uint32_t c3[256];
        uint32_t c2[256];
        uint32_t c1[256];
    };
    uint32_t counts[256 * 8];
} rscounts_t;

uint64_t * radixSort(uint64_t * array, uint32_t size) {
    rscounts_t counts;
    memset(&counts, 0, 256 * 8 * sizeof(uint32_t));
    uint64_t * cpy = (uint64_t *)malloc(size * sizeof(uint64_t));
    uint32_t o8=0, o7=0, o6=0, o5=0, o4=0, o3=0, o2=0, o1=0;
    uint32_t t8, t7, t6, t5, t4, t3, t2, t1;
    uint32_t x;
    // calculate counts
    for(x = 0; x < size; x++) {
        t8 = array[x] & 0xff;
        t7 = (array[x] >> 8) & 0xff;
        t6 = (array[x] >> 16) & 0xff;
        t5 = (array[x] >> 24) & 0xff;
        t4 = (array[x] >> 32) & 0xff;
        t3 = (array[x] >> 40) & 0xff;
        t2 = (array[x] >> 48) & 0xff;
        t1 = (array[x] >> 56) & 0xff;
        counts.c8[t8]++;
        counts.c7[t7]++;
        counts.c6[t6]++;
        counts.c5[t5]++;
        counts.c4[t4]++;
        counts.c3[t3]++;
        counts.c2[t2]++;
        counts.c1[t1]++;
    }
    // convert counts to offsets
    for(x = 0; x < 256; x++) {
        t8 = o8 + counts.c8[x];
        t7 = o7 + counts.c7[x];
        t6 = o6 + counts.c6[x];
        t5 = o5 + counts.c5[x];
        t4 = o4 + counts.c4[x];
        t3 = o3 + counts.c3[x];
        t2 = o2 + counts.c2[x];
        t1 = o1 + counts.c1[x];
        counts.c8[x] = o8;
        counts.c7[x] = o7;
        counts.c6[x] = o6;
        counts.c5[x] = o5;
        counts.c4[x] = o4;
        counts.c3[x] = o3;
        counts.c2[x] = o2;
        counts.c1[x] = o1;
        o8 = t8; 
        o7 = t7; 
        o6 = t6; 
        o5 = t5; 
        o4 = t4; 
        o3 = t3; 
        o2 = t2; 
        o1 = t1;
    }
    // radix
    for(x = 0; x < size; x++) {
        t8 = array[x] & 0xff;
        cpy[counts.c8[t8]] = array[x];
        counts.c8[t8]++;
    }
    for(x = 0; x < size; x++) {
        t7 = (cpy[x] >> 8) & 0xff;
        array[counts.c7[t7]] = cpy[x];
        counts.c7[t7]++;
    }
    for(x = 0; x < size; x++) {
        t6 = (array[x] >> 16) & 0xff;
        cpy[counts.c6[t6]] = array[x];
        counts.c6[t6]++;
    }
    for(x = 0; x < size; x++) {
        t5 = (cpy[x] >> 24) & 0xff;
        array[counts.c5[t5]] = cpy[x];
        counts.c5[t5]++;
    }
    for(x = 0; x < size; x++) {
        t4 = (array[x] >> 32) & 0xff;
        cpy[counts.c4[t4]] = array[x];
        counts.c4[t4]++;
    }
    for(x = 0; x < size; x++) {
        t3 = (cpy[x] >> 40) & 0xff;
        array[counts.c3[t3]] = cpy[x];
        counts.c3[t3]++;
    }
    for(x = 0; x < size; x++) {
        t2 = (array[x] >> 48) & 0xff;
        cpy[counts.c2[t2]] = array[x];
        counts.c2[t2]++;
    }
    for(x = 0; x < size; x++) {
        t1 = (cpy[x] >> 56) & 0xff;
        array[counts.c1[t1]] = cpy[x];
        counts.c1[t1]++;
    }
    free(cpy);
    return array;
}

EDIT this implementation was based on a JavaScript version Fastest way to sort 32bit signed integer arrays in JavaScript?

Here's the IDEONE for the C radix sort http://ideone.com/JHI0d9

Adam · Answer

I see a few options, roughly in order of easiest to hardest.

Enable link-time optimization with the -flto switch. This may get the compiler to inline your comparison function. It's too easy not to try.
If LTO has no effect, you can use an inline qsort implementation like Michael Tokarev's inline qsort. This page suggests a 2x improvement, again solely due to the compiler's ability to inline the comparison function.
Use the C++ std::sort. I know your code is in C, but you can make a small module that only sorts and provides a C interface. You're already using a toolchain that has great C++ support.
Try swenson/sort's library. It implements many algorithms so you can use the one that works best on your data. It appears to be inlineable, and they claim to be faster than qsort.
Find another sorting library. Something that can do Louis' Radix Sort is a good suggestion.

Note you can also do your comparison with a single branch instead of two. Just find out which is bigger, then subtract.

Fastest sort algorithm for millions of UINT64 RGBZ graphics pixels

Tags:

performance

c

algorithm

sorting

First attempt at a better `qsort`, `QSORT()`!

Next attempt: `RADIXSORT`:

BrianP007

2 Answers

Louis Ricci

Adam

Recent Activity

Donate For Us

Fastest sort algorithm for millions of UINT64 RGBZ graphics pixels

Tags:

performance

c

algorithm

sorting

First attempt at a better qsort, QSORT()!

Next attempt: RADIXSORT:

BrianP007

2 Answers

Louis Ricci

Adam

Related questions

Recent Activity

Donate For Us

First attempt at a better `qsort`, `QSORT()`!

Next attempt: `RADIXSORT`: