Sorting algorithm with Cuda. Inside or outside kernels?

Question

I have a matrix of size 50000x100 and I need to sort each row using Cuda in C++. My architecture is a K80 NVidia card.

Since the number of columns is small, I am currently running the sorting algorithm inside a kernel. I am using a modified bubble algorithm that runs on all lines of the matrix.

I am wondering if there is an more efficient way to proceed. I tried to use thrust::sort inside my kernel but it is much slower. I also tried a merge sort algorithm but the recursive part of the algorithm didn't work inside my kernel.

==edit==

here is my kernel:

__global__ void computeQuantilesKernel(float *matIn, int nRows, int nCols, int nQuantiles, float *outsideValues, float *quantilesAve, int param2)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float values[100];//big enough for 100 columns
    int keys[100];
    int nQuant[100];//big enough for 100 quantiles (percentiles)
    float thisQuantile[100];
    int quant;

    if (idx >= nRows) return;

    //read matIn from global memory
    for (int i = 0; i < nCols; i++)
    {
        values[i] = matIn[idx * nCols + i + param2 * nCols * nRows];
        keys[i] = i;
    }

    //bubble Sort:
    int i, j;
    int temp;
    float tempVal;

    for (i = 0; i < nCols - 1; i++)
    {
        for (j = 0; j < nCols - i - 1; j++)
        {
            if (values[j + 1] < values[j])      // ascending order simply changes to <
            {
                tempVal = values[j];             // swap elements
                temp = keys[j];             // swap elements
                values[j] = values[j + 1];
                keys[j] = keys[j + 1];
                values[j + 1] = tempVal;
                keys[j + 1] = temp;
            }
        }
    }
    //end of bubble sort

    //reset nQuant and thisQuantile
    for (int iQuant = 0; iQuant < nQuantiles; iQuant++)
    {
        nQuant[iQuant] = 0;
        thisQuantile[iQuant] = 0;
    }

    //Compute sum of outsideValues for each quantile
    for (int i = 0; i < nCols; i++)
    {
        quant = (int)(((float)i + 0.5) / ((float)nCols / (float)nQuantiles));//quantile like Matlab
        nQuant[quant]++;
        thisQuantile[quant] += outsideValues[idx * nCols + keys[i]];
    }

    //Divide by the size of each quantile to get averages
    for (int iQuant = 0; iQuant < nQuantiles; iQuant++)
    {
        quantilesAve[idx + nRows * iQuant + param2 * nQuantiles * nRows] = thisQuantile[iQuant] / (float)nQuant[iQuant];
    }
}

CygnusX1 · Accepted Answer

Your code as it stands uses a single thread to handle each of your rows separately. As a result you are starving for quick scratch memory (registers, L1 cache, shared memory). You are allocating at least 1600 bytes per each thread - that is a lot! You want to stay at around 128 bytes per thread (32 registers of 32 bits each). Secondly, you are using local arrays addressable at run-time -- those arrays will be spilled into local memory, trash your L1 cache and end up in global memory again (1600B x 32 threads gives 51KB, which is already at or above the limits of shmem/L1).

For that reason I would suggest handling a single row per block of 64 or 128 threads instead, and keep the row you sort in shared memory. Bubble sort is actually very easy to implement in parallel:

__shared__ float values[nCols];
... load the data ...
__syncthreads();
for (int i = 0; i < nCols/2; i++)
{
    int j = threadIdx.x;
    if (j % 2 == 0 && j<nCols-1)
        if (values[j+1] < values[j])
            swap(values[j+1], values[j]);
    __syncthreads();
    if (j % 2 == 1 && j<nCols-1)
        if (values[j+1] < values[j])
            swap(values[j+1], values[j]);
    __syncthreads();
}

Notice how your inner for j = ... loop is replaced by threadIdx, but the core idea of the algorithm stays the same. In each iteration I perform bubble swap first only on even pairs and then only on odd pairs to avoid parallel conflicts.

I assume that nCols is lower than the dimension of your block, which for 100 elements is easily achievable.

There are many ways that the above code can be improved further, for example

Cut the thread count in half and assume j=threadIdx.x*2 for the first half of the loop, and j=threadIdx.x*2+1 for the second half. This way no thread stays idle.
Use only 32 threads, each handling two values of j sequentially. This way your problem will fit a single warp, allowing you to drop __syncthreads() altogether. With 32 threads, you might be able to use warp shuffle intrinsics.
Experiment with #pragma unroll, although the amount of produce code may be unfeasible. Profiling will help.

Also consider experimenting with hardcoded merge sort instead of bubble sort. If my memory serves me right, when I implemented a warp-sized bubble sort and merge-sort with all loops unrolled, merge sort performed almost twice as fast as bubble sort. Note, it was several years ago, on the first generation of CUDA-capable cards.

Sorting algorithm with Cuda. Inside or outside kernels?

Tags:

c++

algorithm

sorting

cuda

Philippe Huber

1 Answers

CygnusX1

Recent Activity

Donate For Us

Sorting algorithm with Cuda. Inside or outside kernels?

Tags:

c++

algorithm

sorting

cuda

Philippe Huber

1 Answers

CygnusX1

Related questions

Recent Activity

Donate For Us