Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel radix sort, how would this implementation actually work? Are there some heuristics?

I am working on an Udacity quiz for their parallel programming course. I am pretty stuck on how I should start on the assignment because I am not sure if I understand it correctly.

For the assignment (in code) we are given two arrays and array on values and an array of positions. We are supposed to sort the array of values with a parallelized radix sort, along with setting the positions correctly too.

I completely understand radix sort and how it works. What I don't understand is how they want us to implemented it. Here is the template given to start the assignment

//Udacity HW 4
//Radix Sorting

#include "reference_calc.cpp"
#include "utils.h"

/* Red Eye Removal
   ===============

   For this assignment we are implementing red eye removal.  This is
   accomplished by first creating a score for every pixel that tells us how
   likely it is to be a red eye pixel.  We have already done this for you - you
   are receiving the scores and need to sort them in ascending order so that we
   know which pixels to alter to remove the red eye.

   Note: ascending order == smallest to largest

   Each score is associated with a position, when you sort the scores, you must
   also move the positions accordingly.

   Implementing Parallel Radix Sort with CUDA
   ==========================================

   The basic idea is to construct a histogram on each pass of how many of each
   "digit" there are.   Then we scan this histogram so that we know where to put
   the output of each digit.  For example, the first 1 must come after all the
   0s so we have to know how many 0s there are to be able to start moving 1s
   into the correct position.

   1) Histogram of the number of occurrences of each digit
   2) Exclusive Prefix Sum of Histogram
   3) Determine relative offset of each digit
        For example [0 0 1 1 0 0 1]
                ->  [0 1 0 1 2 3 2]
   4) Combine the results of steps 2 & 3 to determine the final
      output location for each element and move it there

   LSB Radix sort is an out-of-place sort and you will need to ping-pong values
   between the input and output buffers we have provided.  Make sure the final
   sorted results end up in the output buffer!  Hint: You may need to do a copy
   at the end.

 */


void your_sort(unsigned int* const d_inputVals,
               unsigned int* const d_inputPos,
               unsigned int* const d_outputVals,
               unsigned int* const d_outputPos,
               const size_t numElems)
{

}

I specifically don't understand how those 4 steps end up sorting the array.

So for the first step, I am supposed to create a histogram of the "digits" (why is that in quotes..?). So given a input value n I need to make a count of the 0's and 1's into a histogram. So, should step 1 create an array of histograms, one for each input value?

And well, for the rest of the steps it breaks down pretty quickly. Could someone show me how these steps are supposed to implement a radix sort?

like image 367
KDecker Avatar asked Oct 05 '14 19:10

KDecker


People also ask

How is a radix sort algorithm implemented?

The Radix sort algorithm works by ordering each digit from least significant to most significant. In base 10, radix sort would sort by the digits in the one's place, then the ten's place, and so on. To sort the values in each digit place, Radix sort employs counting sort as a subroutine.

Which technique is used to implement radix sort?

Radix sort is a non-comparative sorting algorithm that is used to sorts the data in lexicographical (dictionary) order. It uses counting sort as a subroutine, to sort an array of integer digit by digity and array of strings character by character.

When should we use radix sort for sorting purpose?

Hence, radix sort is a stable sorting algorithm. (A sorting algorithm is stable if, for any two elements with the same key, the relative order of the two elements is preserved in the sorted output as it is in the original input.)

Is radix sort parallel?

The idea of the parallel Radix Sort is, for each key digit, to first sort locally the data on each processor according to the digit value. All the processors do that concurrently.


2 Answers

The basic idea behind a radix sort is that we will consider each element to be sorted digit by digit, from least significant to most significant. For each digit, we will move the elements so that those digits are in increasing order.

Let's take a really simple example. Let's sort four quantities, each of which have 4 binary digits. Let's choose 1, 4, 7, and 14. We'll mix them up and also visualize the binary representation:

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001

First we will consider bit 0:

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001
bit 0:       1       0       0       1

Now the radix sort algorithm says we must move the elements in such a way that (considering only bit 0) all the zeroes are on the left, and all the ones are on the right. Let's do this while preserving the order of the elements with a zero bit and preserving the order of the elements with a one bit. We could do that like this:

Element #    2       3       1       4
Value:       14      4       7       1
Binary:      1110    0100    0111    0001
bit 0:       0       0       1       1

The first step of our radix sort is complete. The next step is to consider the next (binary) digit:

Element #    3       2       1       4
Value:       4       14      7       1
Binary:      0100    1110    0111    0001
bit 1:       0       1       1       0

Once again, we must move elements so that the digit in question (bit 1) is arranged in ascending order:

Element #    3       4       2       1
Value:       4       1       14      7
Binary:      0100    0001    1110    0111
bit 1:       0       0       1       1

Now we must move to the next higher digit:

Element #    3       4       2       1
Value:       4       1       14      7
Binary:      0100    0001    1110    0111
bit 2:       1       0       1       1

And move them again:

Element #    4       3       2       1
Value:       1       4       14      7
Binary:      0001    0100    1110    0111
bit 2:       0       1       1       1

Now we move to the last (highest order) digit:

Element #    4       3       2       1
Value:       1       4       14      7
Binary:      0001    0100    1110    0111
bit 3:       0       0       1       0

And make our final move:

Element #    4       3       1       2
Value:       1       4       7       14
Binary:      0001    0100    0111    1110
bit 3:       0       0       0       1

And the values are now sorted. This hopefully seems clear, but in the description so far we've glossed over the details of things like "how do we know which elements to move?" and "how do we know where to put them?" So let's repeat our example, but we'll use the specific methods and sequence suggested in the prompt, in order to answer these questions. Starting over with bit 0:

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001
bit 0:       1       0       0       1

First let's build a histogram of the number of zero bits in bit 0 position, and the number of 1 bits in bit 0 position:

bit 0:       1       0       0       1

              zero bits       one bits
              ---------       --------
1)histogram:         2              2

Now let's do an exclusive prefix-sum on these histogram values:

              zero bits       one bits
              ---------       --------
1)histogram:         2              2
2)prefix sum:        0              2

An exclusive prefix-sum is just the sum of all preceding values. There are no preceding values in the first position, and in the second position the preceding value is 2 (the number of elements with a 0 bit in bit 0 position). Now, as an independent operation, let's determine the relative offset of each 0 bit amongst all the zero bits, and each one bit amongst all the one bits:

bit 0:       1       0       0       1
3)offset:    0       0       1       1

This can actually be done programmatically using exclusive prefix-sums again, considering the 0-group and 1-group separately, and treating each position as if it has a value of 1:

0 bit 0:             1       1       
3)ex. psum:          0       1    

1 bit 0:     1                        1      
3)ex. psum:  0                        1   

Now, step 4 of the given algorithm says:

4) Combine the results of steps 2 & 3 to determine the final output location for each element and move it there

What this means is, for each element, we will select the histogram-bin prefix sum value corresponding to its bit value (0 or 1) and add to that, the offset associated with its position, to determine the location to move that element to:

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001
bit 0:       1       0       0       1
hist psum:   2       0       0       2
offset:      0       0       1       1
new index:   2       0       1       3

Moving each element to its "new index" position, we have:

Element #    2       3       1       4
Value:       14      4       7       1
Binary:      0111    1110    0111    0001

Which is exactly the result we expect for the completion of our first digit-move, based on the previous walk-through. This has completed step 1, i.e. the first (least-significant) digit; we still have the remaining digits to process, creating a new histogram and new prefix sums at each step.

Notes:

  1. Radix-sort, even in a computer, does not have to be done based strictly on binary digits. It's possible to construct a similar algorithm with digits of different sizes, perhaps consisting of 2,3, or 4 bits.
  2. One of the optimizations we can perform on a radix sort is to only sort based on the number of digits that are actually meaningful. For example, if we are storing quantities in 32-bit values, but we know that the largest quantity present is 1023 (2^10-1), we need not sort on all 32 bits. We can stop, expecting a proper sort, after proceeding through the first 10 bits.
  3. What does any of this have to do with GPUs? In so far as the above description goes, not much. The practical application is to consider using parallel algorithms for things like the histogram, the prefix-sums, and the data movement. This decomposition of radix-sort allows one to locate and use parallel algorithms already developed for these more basic operations, in order to construct a fast parallel sort.

What follows is a worked example. This may help with your understanding of radix sort. I don't think it will help with your assignment, because this example performs a 32-bit radix sort at the warp level, for a single warp, ie. for 32 quantities. But a possible advantage from an understanding point of view is that things like histogramming and prefix sums can be done at the warp level in just a few instructions, taking advantage of various CUDA intrinsics. For your assignment, you won't be able to use these techniques, and you will need to come up with full-featured parallel prefix sums, histograms, etc. that can operate on an arbitrary dataset size.

#include <stdio.h>
#include <stdlib.h>
#define WSIZE 32
#define LOOPS 100000
#define UPPER_BIT 31
#define LOWER_BIT 0

__device__ unsigned int ddata[WSIZE];

// naive warp-level bitwise radix sort

__global__ void mykernel(){
  __shared__ volatile unsigned int sdata[WSIZE*2];
  // load from global into shared variable
  sdata[threadIdx.x] = ddata[threadIdx.x];
  unsigned int bitmask = 1<<LOWER_BIT;
  unsigned int offset  = 0;
  unsigned int thrmask = 0xFFFFFFFFU << threadIdx.x;
  unsigned int mypos;
  //  for each LSB to MSB
  for (int i = LOWER_BIT; i <= UPPER_BIT; i++){
    unsigned int mydata = sdata[((WSIZE-1)-threadIdx.x)+offset];
    unsigned int mybit  = mydata&bitmask;
    // get population of ones and zeroes (cc 2.0 ballot)
    unsigned int ones = __ballot(mybit); // cc 2.0
    unsigned int zeroes = ~ones;
    offset ^= WSIZE; // switch ping-pong buffers
    // do zeroes, then ones
    if (!mybit) // threads with a zero bit
      // get my position in ping-pong buffer
      mypos = __popc(zeroes&thrmask);
    else        // threads with a one bit
      // get my position in ping-pong buffer
      mypos = __popc(zeroes)+__popc(ones&thrmask);
    // move to buffer  (or use shfl for cc 3.0)
    sdata[mypos-1+offset] = mydata;
    // repeat for next bit
    bitmask <<= 1;
    }
  // save results to global
  ddata[threadIdx.x] = sdata[threadIdx.x+offset];
  }

int main(){

  unsigned int hdata[WSIZE];
  for (int lcount = 0; lcount < LOOPS; lcount++){
    unsigned int range = 1U<<UPPER_BIT;
    for (int i = 0; i < WSIZE; i++) hdata[i] = rand()%range;
    cudaMemcpyToSymbol(ddata, hdata, WSIZE*sizeof(unsigned int));
    mykernel<<<1, WSIZE>>>();
    cudaMemcpyFromSymbol(hdata, ddata, WSIZE*sizeof(unsigned int));
    for (int i = 0; i < WSIZE-1; i++) if (hdata[i] > hdata[i+1]) {printf("sort error at loop %d, hdata[%d] = %d, hdata[%d] = %d\n", lcount,i, hdata[i],i+1, hdata[i+1]); return 1;}
    // printf("sorted data:\n");
    //for (int i = 0; i < WSIZE; i++) printf("%u\n", hdata[i]);
    }
  printf("Success!\n");
  return 0;
}
like image 68
Robert Crovella Avatar answered Oct 16 '22 09:10

Robert Crovella


The methodology that @Robert Crovella gives is absolutely correct and very helpful. It is mildly different than the process that they explain in the Udacity videos. I'll record one iteration of their method, watchable here, in this answer, jumping off from Robert Crovella's example:

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001
LSB:         1       0       0       1

Predicate:   0     __1__   __1__     0
Pred. Scan:  0     __0__   __1__     2

Number of ones in predicate: 2

!Predicate:__1__     0       0     __1__
!Pred. Scan: 0       1       1       1

Offset for !Pred. Scan = Number of ones in predicate = 2

!Pred. Scan + Offset:
           __2__     3       3     __3__

Final indexes to move values after 1 iteration (on LSB):
             2       0       1       3

Values after 1 iteration (on LSB):
             14      4       7       1  

I placed emphasis (__ __) on the values that indicate or contain the index to move the value to.

Terms (from Udacity video):

  • LSB = least significant bit
  • Predicate (for LSB): (x & 1) == 0
    • for the next significant bit: (x & 2) == 0
    • for the one after that: (x & 4) == 0
    • and so on, with more left shifting (<<)
  • Pred. Scan = Predicate Scan = Predicate exclusive prefix sum
  • !Pred. = bits of predicate flipped (0->1 and 1->0)
  • Number of ones in predicate
    • note that this is not necessarily the last entry in the scan, you can instead get this value (sum/reduction of the predicate) as an intermediate of the Blelloch scan

A summary of the above is:

  1. Get the predicate of your list (bit in common, starting from the LSB)
  2. Scan the predicate, and record the sum of the predicate in the process
    • Blelloch Scan on the GPU
    • note that your predicate will be of arbitrary size, so read the section on Blelloch Scan for arrays of arbitrary instead of 2^n size
  3. Flip bits of the predicate, and scan that
  4. Move the values in your array with the following rule:
    • For the ith element in the array:
    • if the ith predicate is TRUE, move the ith value to the index in the ith element of the predicate scan
    • else, move the ith value to the index in the ith element of the !Predicate scan plus the sum of the Predicate
  5. Move to the next significant bit (NSB)

For reference, you can consult my solution for this HW assignment in CUDA.

like image 36
polpetti Avatar answered Oct 16 '22 09:10

polpetti