Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C - Fastest way to sort a large 2D integer array

I have a 2D array, where each row contains 6 integers that are already sorted in ascending order. Example:

1  2  3  4  5  6
6  8  9 10 13 15
1  4  5  6  7  9
1  4  5  6  7  8
3 18 19 20 25 34

Expected output:

1  2  3  4  5  6
1  4  5  6  7  8
1  4  5  6  7  9
3 18 19 20 25 34
6  8  9 10 13 15

The actual data contains anywhere from 8m to 33m records like this. I'm trying to determine the fastest way to sort this array. I currently have some working code using qsort:

qsort call:

qsort(allRecords, lineCount, sizeof(int*), cmpfunc);

cmpfunc:

int cmpfunc (const void * a, const void * b)
{
  const int *rowA = *(const int **)a;
  const int *rowB = *(const int **)b;

  if (rowA[0] > rowB[0]) return 1;
  if (rowA[0] < rowB[0]) return -1;

  if (rowA[1] > rowB[1]) return 1;
  if (rowA[1] < rowB[1]) return -1;

  if (rowA[2] > rowB[2]) return 1;
  if (rowA[2] < rowB[2]) return -1;

  if (rowA[3] > rowB[3]) return 1;
  if (rowA[3] < rowB[3]) return -1;

  if (rowA[4] > rowB[4]) return 1;
  if (rowA[4] < rowB[4]) return -1;

  if (rowA[5] > rowB[5]) return 1;
  if (rowA[5] < rowB[5]) return -1;

  return 0;
}

For the sample 33 million records, it takes about 35.6 seconds (gcc -O1), which is pretty fast, but I'm wondering if there's any faster way to do it given the pre-sorted values in each row.

This naturally lends to data where the most common first digit is 1, so in a 33m record file, there might be 12m records starting with 1, then 8m records starting with 2, 5m records starting with 3, etc... I'm not sure if this would lend itself to one particular type of sorting over another (e.g. heapsort).

My understand is that qsort has a fair amount of overhead because of all the times it has to call the function, so I'm hoping for some even faster performance.

I'm not usually writing C code, so I'm very open to suggestions and criticism, since I'm piecing this all together from tutorials and other StackOverflow questions/answers.

EDIT: As requested, my initialization code:

// Empty record
int recArray[6] = {0,0,0,0,0,0};

// Initialize allRecords
int** allRecords;
allRecords = (int**) malloc(lineCount*sizeof(int*));
for(i=0; i < lineCount; i++)
{
    allRecords[i] = (int*) malloc(6*sizeof(int));
}

// Zero-out all records
for(i=0; i < lineCount; i++)
{
  memcpy(allRecords[i], recArray, 6 * sizeof(int));
}

I'm still learning the right way to do stuff, so I wouldn't be surprised if I was doing it all wrong. Guidance in doing it right would be appreciated.

Someone else asked about the range of values - I'm not sure if the range will change in the future, but at the current moment, the values are between 1 and 99.

Also, for profiling - I built a small function that uses gettimeofday() to pull seconds/microseconds, and then compare before and after. I'm open to better ways. The output looks like:

// <-- Here I capture the gettimeofday() structure output
Sorting...
Sorted.
Time Taken: 35.628882s // <-- Capture it again, show the difference

EDIT: Per @doynax - I now "pack" the 6 values of each line into an unsigned long long int:

// Initialize allRecords
unsigned long long int* allRecords;
allRecords = (unsigned long long int*) malloc(lineCount*sizeof(unsigned long long int));
for(i=0; i < lineCount; i++)
{
    allRecords[i] = 0;
}

...

// "Pack" current value (n0) into an unsigned long long int
if(recPos == 0) { lineSum += n0 * UINT64_C(1); }
else if(recPos == 1) { lineSum += n0 * UINT64_C(100); }
else if(recPos == 2) { lineSum += n0 * UINT64_C(10000); }
else if(recPos == 3) { lineSum += n0 * UINT64_C(1000000); }
else if(recPos == 4) { lineSum += n0 * UINT64_C(100000000); }
else if(recPos == 5) { lineSum += n0 * UINT64_C(10000000000); }
...
allRecords[linecount] = lineSum;
lineSum = 0;

I can also later "unpack" one of these unsigned long long int values successfully back into the original 6 ints.

However, when I try to sort:

qsort(allRecords, lineCount, sizeof(unsigned long long int), cmpfunc);

...

int cmpfunc (const void * a, const void * b)
{
  if (*(unsigned long long int*)a > *(unsigned long long int*)b) return 1;
  if (*(unsigned long long int*)a < *(unsigned long long int*)b) return -1;
  return 0;
}

...the results aren't sorted as expected. If I show the first and last lines before and after sorting using this:

printf("[%i] = %llu = %i,%i,%i,%i,%i,%i\n", j, lineSum, recArray[0]...recArray[5]);

The output is:

First and last 5 rows before sorting:
[#] = PACKED INT64 = UNPACKED
[0] = 462220191706 = 6,17,19,20,22,46
[1] = 494140341005 = 5,10,34,40,41,49
[2] = 575337201905 = 5,19,20,37,53,57
[3] = 504236262316 = 16,23,26,36,42,50
[4] = 534730201912 = 12,19,20,30,47,53
[46] = 595648302516 = 16,25,30,48,56,59
[47] = 453635251108 = 8,11,25,35,36,45
[48] = 403221161202 = 2,12,16,21,32,40
[49] = 443736310604 = 4,6,31,36,37,44
[50] = 575248312821 = 21,28,31,48,52,57

First and last 5 rows after sorting:
[0] = 403221161202 = 2,12,16,21,32,40
[1] = 413218141002 = 2,10,14,18,32,41
[2] = 443736310604 = 4,6,31,36,37,44
[3] = 444127211604 = 4,16,21,27,41,44
[4] = 453028070302 = 2,3,7,28,30,45
[46] = 585043260907 = 7,9,26,43,50,58
[47] = 593524170902 = 2,9,17,24,35,59
[48] = 595248392711 = 11,27,39,48,52,59
[49] = 595251272612 = 12,26,27,51,52,59
[50] = 595648302516 = 16,25,30,48,56,59

I'm guessing I'm somehow comparing the wrong values (e.g. pointer values instead of the actual values), but I'm not quite sure what the right syntax is.

On the plus side, it's blazing fast this way. :)

Sorting 33m 64-bit ints takes about 4-5 seconds (at least in its current, mistaken form).

like image 794
jhilgeman Avatar asked Apr 18 '17 18:04

jhilgeman


2 Answers

I can't resist a good optimization challenge.

Initial Thoughts

My first thought in seeing the problem was to use a radix sort, specifically a radix sort with an internal counting sort. (And it's nice to see that some of the comments agreed with me, too!) I've used radix sort with internal counting sort in other projects, and while quicksort can outperform it for small datasets, it leaves quicksort in the dust for large datasets.

I went for radix sort because comparison sorts like quicksort have a well-known performance limit of O(n lg n). Because n is pretty big here, that "lg n" part is also pretty big — for 33 million records, you can reasonably expect it will take "some constant time" times 33000000 times lg(33000000), and lg(33000000) is about 25. Sorts like radix sort and counting sort are "non-comparing" sorts, so they can cut under that O(n lg n) boundary by having special knowledge of the data — in this case, they can go much faster because we know the data is small integers.

I used the same mash-the-values-together code that everybody else did, combining each set of int values into a single BigInt, but I went for bit-shifts and masking rather than multiplication and modulus, since bit operations tend to be faster overall.

My implementation (see below) performs some constant-time setup computations, and then it iterates over the data exactly 7 times to sort it — and "7 times" not only beats the pants off "25 times" algorithmically, but my implementation is careful to avoid touching the memory more often than necessary so that it plays as well as it can with CPU caches too.

For reasonable source data, I wrote a little program that generated 33 million records that looked vaguely like your originals. Each row is in sorted order, and slightly biased toward smaller values. (You can find my sixsample.c sample-dataset program below.)

Performance Comparison

Using something very close to your qsort() implementation, I got these numbers over three runs, and they aren't too far off from what you were getting (different CPU, obviously):

qsort:

Got 33000000 lines. Sorting.
Sorted lines in 5 seconds and 288636 microseconds.
Sorted lines in 5 seconds and 415553 microseconds.
Sorted lines in 5 seconds and 242454 microseconds.

Using my radix-sort-with-counting-sort implementation, I got these numbers over three runs:

Radix sort with internal counting sort, and cache-optimized:

Got 33000000 lines. Sorting.
Sorted lines in 0 seconds and 749285 microseconds.
Sorted lines in 0 seconds and 833474 microseconds.
Sorted lines in 0 seconds and 761943 microseconds.

5.318 seconds vs 0.781 seconds to sort the same 33 million records. That's 6.8x faster!

Just to be sure, I diffed the output from each run to guarantee the results were the same — you really can sort tens of millions of records in under a second!

Deeper Dive

The basic principles of a radix sort and a counting sort are described well on Wikipedia, and in one of my favorite books, Introduction to Algorithms, so I won't repeat the basics here. Instead, I'll describe a bit about how my implementation differs from the textbook forms to run a bit faster.

The normal implementation of these algorithms is something like this (in Python-esque pseudocode):

def radix-sort(records):
    temp = []
    for column = 0 to columns:
         counting-sort(records, temp, column)
         copy 'temp' back into 'records'

def counting-sort(records, dest, column):

    # Count up how many of each value there is at this column.
    counts = []
    foreach record in records:
        counts[record[column]] += 1

    transform 'counts' into offsets

    copy from 'records' to 'dest' using the calculated offsets

This certainly works, but it comes with a few uglinesses. The "count up the values" step involves counting values in the whole dataset for every column. The "copy temp back into records" step involves copying the whole dataset for every column column. And those are in addition to the required step of "copy from records-to-dest in sorted order". We spin over the data three times for every column; it would be nice to do that less often!

My implementation mixes these together: Rather than calculating the counts for each column separately, they're calculated together as a first pass over the data. Then the counts are transformed in bulk into offsets; and finally, the data is sorted into place. In addition, rather than copying from temp to records at each stage, I simply page-flip: The data alternates where it's "coming from" every other iteration. My pseudocode is something more like this:

def radix-sort(records):

    # Count up how many of each value there is in every column,
    # reading each record only once.
    counts = [,]
    foreach record in records:
        foreach column in columns:
            counts[column, record[column]] += 1

    transform 'counts' for each column into offsets for each column

    temp = []
    for column = 0 to columns step 2:
         sort-by-counts(records, temp, column)
         sort-by-counts(temp, records, column+1)

def sort-by-counts(records, dest, counts, column):
    copy from 'records' to 'dest' using the calculated offsets

This avoids spinning over the data any more than necessary: One pass for setup, and six passes to sort it, all in order. You can't make the CPU cache much happier.

The C Code

Here's my complete implementation of sixsort.c. It includes your qsort solution as well, commented-out, so it's easier for you to compare the two. It includes lots of documentation, and all of the I/O and metrics code, so it's somewhat long, but its completeness should help you to understand the solution better:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <sys/time.h>

/* Configuration. */
#define NUM_COLUMNS 6       /* How many columns to read/sort */
#define MAX_VALUE 100       /* Maximum value allowed in each column, plus one */
#define BITSHIFT 7      /* How far to shift bits when combining values */

/* Note: BITSHIFT should be as small as possible, but MAX_VALUE must be <= 2**BITSHIFT. */

/* Highest value representable under the given BITSHIFT. */
#define BITMASK ((1 << BITSHIFT) - 1)

/* The type we're going to pack the integers into.  The highest bits will get the
   leftmost number, while the lowest bits will get the rightmost number. */
typedef unsigned long long BigInt;

/*-------------------------------------------------------------------------------------------------
** Radix Sorting.
*/

/* Move a set of items from src to dest, using the provided counts to decide where each
   item belongs.  src and dest must not be the same arrays. */
static void SortByCounts(BigInt *src, BigInt *dest, size_t *counts, size_t totalCount, int bitshift)
{
    BigInt *temp, *end;

    for (temp = src, end = src + totalCount; temp < end; temp++) {
        BigInt value = *temp;
        int number = (int)((value >> bitshift) & BITMASK);
        int offset = counts[number]++;
        dest[offset] = value;
    }
}

/* Use a radix sort with an internal counting sort to sort the given array of values.
   This iterates over the source data exactly 7 times, which for large
   'count' (i.e., count > ~2000) is typically much more efficient than O(n lg n)
   algorithms like quicksort or mergesort or heapsort.  (That said, the recursive O(n lg n)
   algorithms typically have better memory locality, so which solution wins overall
   may vary depending on your CPU and memory system.) */
static void RadixSortWithInternalCountingSort(BigInt *values, size_t count)
{
    size_t i, j;
    BigInt *temp, *end;
    size_t counts[NUM_COLUMNS][MAX_VALUE];
    size_t oldCount;

    /* Reset the counts of each value in each column to zero, quickly.
       This takes MAX_VALUE * NUM_COLUMNS time (i.e., constant time). */
    memset(counts, 0, sizeof(counts));

    /* Count up how many there are of each value in this column.  This iterates over
       the whole dataset exactly once, processing each set of values in full, so it
       takes COUNT * NUM_COLUMNS operations, or theta(n). */
    for (temp = values, end = values + count; temp < end; temp++) {
        BigInt value = *temp;
        for (i = 0; i < NUM_COLUMNS; i++) {
            counts[i][(int)((value >> (i * BITSHIFT)) & BITMASK)]++;
        }
    }

    /* Transform the counts into offsets.  This only transforms the counts array,
       so it takes MAX_VALUE * NUM_COLUMNS operations (i.e., constant time). */
    size_t totals[NUM_COLUMNS];
    for (i = 0; i < NUM_COLUMNS; i++) {
        totals[i] = 0;
    }
    for (i = 0; i < MAX_VALUE; i++) {
        for (j = 0; j < NUM_COLUMNS; j++) {
            oldCount = counts[j][i];
            counts[j][i] = totals[j];
            totals[j] += oldCount;
        }
    }

    temp = malloc(sizeof(BigInt) * count);

    /* Now perform the actual sorting, using the counts to tell us how to move
       the items.  Each call below iterates over the whole dataset exactly once,
       so this takes COUNT * NUM_COLUMNS operations, or theta(n). */
    for (i = 0; i < NUM_COLUMNS; i += 2) {
        SortByCounts(values, temp, counts[i  ], count,  i    * BITSHIFT);
        SortByCounts(temp, values, counts[i+1], count, (i+1) * BITSHIFT);
    }

    free(temp);
}

/*-------------------------------------------------------------------------------------------------
** Built-in Quicksorting.
*/

static int BigIntCompare(const void *a, const void *b)
{
    BigInt av = *(BigInt *)a;
    BigInt bv = *(BigInt *)b;

    if (av < bv) return -1;
    if (av > bv) return +1;
    return 0;
}

static void BuiltInQuicksort(BigInt *values, size_t count)
{
    qsort(values, count, sizeof(BigInt), BigIntCompare);
}

/*-------------------------------------------------------------------------------------------------
** File reading.
*/

/* Read a single integer from the given string, skipping initial whitespace, and
   store it in 'returnValue'.  Returns a pointer to the end of the integer text, or
   NULL if no value can be read. */
static char *ReadInt(char *src, int *returnValue)
{
    char ch;
    int value;

    /* Skip whitespace. */
    while ((ch = *src) <= 32 && ch != '\r' && ch != '\n') src++;

    /* Do we have a valid digit? */
    if ((ch = *src++) < '0' || ch > '9')
        return NULL;

    /* Collect digits into a number. */
    value = 0;
    do {
        value *= 10;
        value += ch - '0';
    } while ((ch = *src++) >= '0' && ch <= '9');
    src--;

    /* Return what we did. */
    *returnValue = value;
    return src;
}

/* Read a single line of values from the input into 'line', and return the number of
   values that were read on this line. */
static int ReadLine(FILE *fp, BigInt *line)
{
    int numValues;
    char buffer[1024];
    char *src;
    int value;
    BigInt result = 0;

    if (fgets(buffer, 1024, fp) == NULL) return 0;
    buffer[1023] = '\0';

    numValues = 0;
    src = buffer;
    while ((src = ReadInt(src, &value)) != NULL) {
        result |= ((BigInt)value << ((NUM_COLUMNS - ++numValues) * BITSHIFT));
    }

    *line = result;
    return numValues;
}

/* Read from file 'fp', which should consist of a sequence of lines with
   six decimal integers on each, and write the parsed, packed integer values
   to a newly-allocated 'values' array.  Returns the number of lines read. */
static size_t ReadInputFile(FILE *fp, BigInt **values)
{
    BigInt line;
    BigInt *dest;
    size_t count, max;

    count = 0;
    dest = malloc(sizeof(BigInt) * (max = 256));

    while (ReadLine(fp, &line)) {
        if (count >= max) {
            size_t newmax = max * 2;
            BigInt *temp = malloc(sizeof(BigInt) * newmax);
            memcpy(temp, dest, sizeof(BigInt) * count);
            free(dest);
            max = newmax;
            dest = temp;
        }
        dest[count++] = line;
    }

    *values = dest;
    return count;
}

/*-------------------------------------------------------------------------------------------------
** File writing.
*/

/* Given a number from 0 to 999 (inclusive), write its string representation to 'dest'
   as fast as possible.  Returns 'dest' incremented by the number of digits written. */
static char *WriteNumber(char *dest, unsigned int number)
{
    if (number >= 100) {
        dest += 3;
        dest[-1] = '0' + number % 10, number /= 10;
        dest[-2] = '0' + number % 10, number /= 10;
        dest[-3] = '0' + number % 10;
    }
    else if (number >= 10) {
        dest += 2;
        dest[-1] = '0' + number % 10, number /= 10;
        dest[-2] = '0' + number % 10;
    }
    else {
        dest += 1;
        dest[-1] = '0' + number;
    }

    return dest;
}

/* Write a single "value" (one line of content) to the output file. */
static void WriteOutputLine(FILE *fp, BigInt value)
{
    char buffer[1024];
    char *dest = buffer;
    int i;

    for (i = 0; i < NUM_COLUMNS; i++) {
        if (i > 0)
            *dest++ = ' ';
        int number = (value >> (BITSHIFT * (NUM_COLUMNS - i - 1))) & BITMASK;
        dest = WriteNumber(dest, (unsigned int)number);
    }
    *dest++ = '\n';
    *dest = '\0';

    fwrite(buffer, 1, dest - buffer, fp);
}

/* Write the entire output file as a sequence of values in columns. */
static void WriteOutputFile(FILE *fp, BigInt *values, size_t count)
{
    while (count-- > 0) {
        WriteOutputLine(fp, *values++);
    }
}

/*-------------------------------------------------------------------------------------------------
** Timeval support.
*/

int timeval_subtract(struct timeval *result, struct timeval *x, struct timeval *y)  
{  
    /* Perform the carry for the later subtraction by updating y. */  
    if (x->tv_usec < y->tv_usec) {  
        int nsec = (y->tv_usec - x->tv_usec) / 1000000 + 1;  
        y->tv_usec -= 1000000 * nsec;  
        y->tv_sec += nsec;  
    }  
    if (x->tv_usec - y->tv_usec > 1000000) {  
        int nsec = (y->tv_usec - x->tv_usec) / 1000000;  
        y->tv_usec += 1000000 * nsec;  
        y->tv_sec -= nsec;  
    }  

    /* Compute the time remaining to wait. tv_usec is certainly positive. */  
    result->tv_sec = x->tv_sec - y->tv_sec;  
    result->tv_usec = x->tv_usec - y->tv_usec;  

    /* Return 1 if result is negative. */  
    return x->tv_sec < y->tv_sec;  
}

/*-------------------------------------------------------------------------------------------------
** Main.
*/

int main(int argc, char **argv)
{
    BigInt *values;
    size_t count;
    FILE *fp;
    struct timeval startTime, endTime, deltaTime;

    if (argc != 3) {
        fprintf(stderr, "Usage: sixsort input.txt output.txt\n");
        return -1;
    }

    printf("Reading %s...\n", argv[1]);

    if ((fp = fopen(argv[1], "r")) == NULL) {
        fprintf(stderr, "Unable to open \"%s\" for reading.\n", argv[1]);
        return -1;
    }
    count = ReadInputFile(fp, &values);
    fclose(fp);

    printf("Got %d lines. Sorting.\n", (int)count);

    gettimeofday(&startTime, NULL);
    RadixSortWithInternalCountingSort(values, count);
    /*BuiltInQuicksort(values, count);*/
    gettimeofday(&endTime, NULL);
    timeval_subtract(&deltaTime, &endTime, &startTime);

    printf("Sorted lines in %d seconds and %d microseconds.\n",
        (int)deltaTime.tv_sec, (int)deltaTime.tv_usec);

    printf("Writing %d lines to %s.\n", (int)count, argv[2]);

    if ((fp = fopen(argv[2], "w")) == NULL) {
        fprintf(stderr, "Unable to open \"%s\" for writing.\n", argv[2]);
        return -1;
    }
    WriteOutputFile(fp, values, count);
    fclose(fp);

    free(values);
    return 0;
}

Also, for reference, here's my crappy little sample-dataset producer sixsample.c, using a predictable LCG random-number generator so it always generates the same sample data:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <math.h>

#define NUM_COLUMNS 6
#define MAX_VALUE 35 

unsigned int RandSeed = 0;

/* Generate a random number from 0 to 65535, inclusive, using a simple (fast) LCG.
   The numbers below are the same as used in the ISO/IEC 9899 proposal for C99/C11. */
static int Rand()
{
    RandSeed = RandSeed * 1103515245 + 12345;
    return (int)(RandSeed >> 16);
}

static int CompareInts(const void *a, const void *b)
{
    return *(int *)a - *(int *)b;
}

static void PrintSortedRandomLine()
{
    int values[NUM_COLUMNS];
    int i;

    for (i = 0; i < NUM_COLUMNS; i++) {
        values[i] = 1 + (int)(pow((double)Rand() / 65536.0, 2.0) * MAX_VALUE);
    }
    qsort(values, NUM_COLUMNS, sizeof(int), CompareInts);

    /* Lame. */
    printf("%d %d %d %d %d %d\n",
        values[0], values[1], values[2], values[3], values[4], values[5]);
}

int main(int argc, char **argv)
{
    unsigned int numLines;
    unsigned int i;

    if (argc < 2 || argc > 3) {
        fprintf(stderr, "Usage: sixsample numLines [randSeed]\n");
        return -1;
    }

    numLines = atoi(argv[1]);
    if (argc > 2)
        RandSeed = atoi(argv[2]);

    for (i = 0; i < numLines; i++) {
        PrintSortedRandomLine();
    }

    return 0;
}

And finally, here's a Makefile that can build them the same way I did, on Ubuntu 16, with gcc 5.4.0:

all: sixsort sixsample

sixsort: sixsort.c
    gcc -O6 -Wall -o sixsort sixsort.c

sixsample: sixsample.c
    gcc -O -Wall -o sixsample sixsample.c

clean:
    rm -f sixsort sixsample

Further Possible Optimizations

I considered further optimizing some parts of this by writing them in pure assembly, but gcc does an excellent job of inlining the functions, unrolling the loops, and generally hitting this with most of the optimizations I would've otherwise considered doing myself in assembly, so it would be difficult to improve it that way. The core operations don't lend themselves very well to SIMD instructions, so gcc's version is likely as fast as it gets.

This would probably run faster if I compiled it as a 64-bit process, since the CPU would be able to hold each complete record in a single register, but my test Ubuntu VM was 32-bit, so c'est la vie.

Given that the values themselves are all fairly small, you might be able to improve my solution one further by combining code points: So instead of performing six counting sorts, one subsort on each of the seven-bit columns, you could divide up the same 42 bits in other ways: You could perform three subsorts on each of the 14-bit columns. This would require storing 3*2^14 (=3*16384=49152) counts and offsets in memory instead of 6*2^7 (=6*128=768), but since memory is fairly cheap, and the data would still fit in the CPU cache, it might be worth trying. And since the original data is two-digit decimal integers, you could even slice it down the middle and perform a pair of subsorts on each of two 6-decimal-digit columns of values (requiring 2*10^6=2000000 counts and offsets, which still might fit in your CPU cache, depending on your CPU).

You could also parallelize it! My implementation lights up one CPU hard, but leaves the rest idle. So another technique to make this faster yet could be to divide the source data into chunks, one per CPU core, sort each chunk independently, and then perform a mergesort-like 'merge' operation at the end to combine the results together. If you find yourself running this computation a lot, parallelizing the work might be something to look into.

Conclusion and Other Thoughts

Although I didn't test it myself, it might be fair to compare the performance of a hand-written quicksort, since qsort() includes function-call overhead when comparing values. I suspect a hand-written solution would beat the 5.3 seconds of qsort(), but it probably wouldn't compete against the radix sort, both due to quicksort's nonlinear growth and its less-favorable CPU-cache characteristics (it would read each value more often than 7 times on average, so it would probably cost more to pull the data from RAM).

Also, the cost of the sorting is now dwarfed by the cost of the I/O to read the text file into memory and write the result back out again. Even with my custom parsing code, it still takes several seconds to read the 33 million records, a fraction of a second to sort them, and a few seconds to write them back out again. If your data is already in RAM, that might not matter, but if it's on a disk somewhere, you'll need to start finding ways to optimize that too.

But as the goal of the question was to optimize the sorting itself, sorting 33 million records in only three-quarters of a second isn't too shabby, I think.

like image 76
Sean Werkema Avatar answered Oct 18 '22 02:10

Sean Werkema


Jonathan Leffler's comment about reordering the packing is spot on and I had the same thought when looking over your code. The following would be my approach:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <string.h> // for memcpy

#define ROW_LENGTH      6
#define ROW_COUNT       15

// 1, 2, 3, 4, 5, 6, 6, 8, 9, 10, 13, 15, 1, 4, 5, 6, 7, 9, 1, 4, 5, 6, 7, 8, 3, 18, 19, 20, 25, 34

/*
    1  2  3  4  5  6
    6  8  9 10 13 15
    1  4  5  6  7  9
    1  4  5  6  7  8
    3 18 19 20 25 34
*/

// Insertion sorting taken from https://stackoverflow.com/a/2789530/2694511  with modification

static __inline__ int sortUlliArray(unsigned long long int *d, int length){
        int i, j;
        for (i = 1; i < length; i++) {
                unsigned long long int tmp = d[i];
                for (j = i; j >= 1 && tmp < d[j-1]; j--)
                        d[j] = d[j-1];
                d[j] = tmp;
        }

        return i; // just to shutup compiler
}


int cmpfunc (const void * a, const void * b)
{
  if (*(unsigned long long int*)a > *(unsigned long long int*)b) return 1;
  if (*(unsigned long long int*)a < *(unsigned long long int*)b) return -1;
  return 0;
}

int main(){
    int array[ROW_COUNT][ROW_LENGTH],
        decodedResultsArray[ROW_COUNT][ROW_LENGTH];

    const int rawData[] = {     1, 2, 3, 4, 5, 6,
                                6, 8, 9, 10, 13, 15,
                                1, 4, 5, 6, 7, 9,
                                1, 4, 5, 6, 7, 8,
                                3, 18, 19, 20, 25, 34,
                                6,17,19,20,22,46,
                                5,10,34,40,41,49,
                                5,19,20,37,53,57,
                                16,23,26,36,42,50,
                                12,19,20,30,47,53,
                                16,25,30,48,56,59,
                                8,11,25,35,36,45,
                                2,12,16,21,32,40,
                                4,6,31,36,37,44,
                                21,28,31,48,52,57
                    };

    memcpy(array, rawData, sizeof(rawData)/sizeof(*rawData)); // copy elements into array memory

    // Sort
    // precompute keys
    unsigned long long int *rowSums = calloc(ROW_COUNT, sizeof(unsigned long long int));
    unsigned long long int *sortedSums = rowSums ? calloc(ROW_COUNT, sizeof(unsigned long long int)) : NULL; // if rowSums is null, don't bother trying to allocate.
    if(!rowSums || !sortedSums){
        free(rowSums);
        free(sortedSums);
        fprintf(stderr, "Failed to allocate memory!\n");
        fflush(stderr); // should be unnecessary, but better to make sure it gets printed
        exit(100);
    }

    int i=0, j=0, k=0;
    for(; i < ROW_COUNT; i++){
        rowSums[i] = 0; // this should be handled by calloc, but adding this for debug
        for(j=0; j < ROW_LENGTH; j++){
            unsigned long long int iScalar=1;
            for(k=ROW_LENGTH-1; k > j; --k)
                iScalar *= 100; // probably not the most efficient way to compute this, but this is meant more as an example/proof of concept

            unsigned long long int iHere = array[i][j];
            rowSums[i] += (iHere * iScalar);

            // printf("DEBUG ITERATION REPORT\n\t\tRow #%d\n\t\tColumn #%d\n\t\tiScalar: %llu\n\t\tiHere: %llu\n\t\tCurrent Sum for Row: %llu\n\n", i, j, iScalar, iHere, rowSums[i]);
            fflush(stdout);
        }
    }

    memcpy(sortedSums, rowSums, sizeof(unsigned long long int)*ROW_COUNT);

    // Some debugging output:
    /*

    printf("Uncopied Sums:\n");
    for(i=0; i < ROW_COUNT; i++)
        printf("SortedRowSums[%d] = %llu\n", i, rowSums[i]);

    printf("Memcopyed sort array:\n");
    for(i=0; i < ROW_COUNT; i++)
        printf("SortedRowSums[%d] = %llu\n", i, sortedSums[i]);

    */

    clock_t begin = clock();

    //qsort(sortedSums, ROW_COUNT, sizeof(unsigned long long int), cmpfunc);
    sortUlliArray(sortedSums, ROW_COUNT);

    clock_t end = clock();
    double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;

    printf("Time for sort: %lf\n", time_spent);
    printf("Before sort array:\n");
    for(i=0; i<ROW_COUNT; i++){
        for(j=0; j < ROW_LENGTH; j++){
            printf("Unsorted[%d][%d] = %d\n", i, j, array[i][j]);
        }
    }

    printf("Values of sorted computed keys:\n");
    for(i=0; i < ROW_COUNT; i++)
        printf("SortedRowSums[%d] = %llu\n", i, sortedSums[i]);

    // Unpack:
    for(i=0; i < ROW_COUNT; i++){
        for(j=0; j < ROW_LENGTH; j++){
            unsigned long long int iScalar=1;
            for(k=ROW_LENGTH-1; k > j; --k)
                iScalar *= 100;

            unsigned long long int removalAmount = sortedSums[i]/iScalar;

            decodedResultsArray[i][j] = removalAmount;
            sortedSums[i] -= (removalAmount*iScalar);
            // DEBUG:
            // printf("Decoded Result for decoded[%d][%d] = %d\n", i, j, decodedResultsArray[i][j]);
        }
    }

    printf("\nFinal Output:\n");
    for(i=0; i < ROW_COUNT; i++){
        printf("Row #%d: %d", i, decodedResultsArray[i][0]);
        for(j=1; j < ROW_LENGTH; j++){
            printf(", %d", decodedResultsArray[i][j]);
        }
        puts("");
    }
    fflush(stdout);

    free(rowSums);
    free(sortedSums);

    return 1;
}

Do note, this is not all optimized for the maximum efficiency and it is littered with debug output statements, but nonetheless, it's a proof of concept on how the packing can work. Also, given the number of rows you have to handle, you will probably be better to use qsort(), but I have it using sortUlliArray(...) (which is a modified version of the Insert-sort function from this StackOverflow answer). You'll have to give it a test to see what performs best for your case.

All in all, the final output from running this code on the 15 hardcoded rows is:

Row #0: 1, 2, 3, 4, 5, 6
Row #1: 1, 4, 5, 6, 7, 8
Row #2: 1, 4, 5, 6, 7, 9
Row #3: 2, 12, 16, 21, 32, 40
Row #4: 3, 18, 19, 20, 25, 34
Row #5: 4, 6, 31, 36, 37, 44
Row #6: 5, 10, 34, 40, 41, 49
Row #7: 5, 19, 20, 37, 53, 57
Row #8: 6, 8, 9, 10, 13, 15
Row #9: 6, 17, 19, 20, 22, 46
Row #10: 8, 11, 25, 35, 36, 45
Row #11: 12, 19, 20, 30, 47, 53
Row #12: 16, 23, 26, 36, 42, 50
Row #13: 16, 25, 30, 48, 56, 59
Row #14: 21, 28, 31, 48, 52, 57

So, this does appear to handle the cases where the numbers are very similar, which was an issue attributable to the order the numbers were packed in.

Anyway, the code above should work, but it is meant as an example, so I will leave it to you to apply necessary optimizations.

Code was tested on a MacBook Air 64-bit with 1.6 GHz Intel Core i5 and 4 GB 1600 MHz DDR3. So, a pretty weak CPU and slow memory, but it was able to do the sort for the 15 rows 0.004 milliseconds, so fairly fast, in my opinion. (That is just a measure of the sort function's speed for the above test case, not for the pre-packing or unpacking speeds, as those could use some optimization.)

Major credit goes to Doynax and Jonathan Leffler.

like image 34
Spencer D Avatar answered Oct 18 '22 02:10

Spencer D