Optimise byte operations CUDA

Question

I am relatively new to Cuda and I am trying to write a kernel which calculates the sum of absolute differences between a query vector and a large database of vectors. The elements of both must be 8 bit unsigned ints. I've based my kernel off nvidias sample parallel reduction kernel, I've also read this thread.

I am only getting about 5GB/s which is not much better than a fast CPU and does not even come close to the theoretical bandwidth of my DDR5 GT640 of about 80GB/s.

my data set consists of 1024 bytes query vector, 100,000 x 1024 bytes database

I have 100,000 blocks of 128 threads, if each block accesses the same 1024 byte query_vector, is that going to cause worse performance? Since every block is accessing the same memory location.

blockSize and the shared memory are both set to 128 and 128*sizeof(int), 128 is #define'd as THREADS_PER_BLOCK

template<UINT blockSize> __global__ void reduction_sum_abs( BYTE* query_vector, BYTE* db_vector, uint32_t* result )
{
    extern __shared__ UINT sum[]; 
    UINT db_linear_index = (blockIdx.y*gridDim.x) + blockIdx.x ; 
    UINT i = threadIdx.x; 

    sum[threadIdx.x] = 0; 

    int* p_q_int = reinterpret_cast<int*>(query_vector); 
    int* p_db_int = reinterpret_cast<int*>(db_vector); 

    while( i < VECTOR_SIZE/4 ) {

        /* memory transaction */
        int q_int = p_q_int[i]; 
        int db_int = p_db_int[db_linear_index*VECTOR_SIZE/4 + i]; 

        uchar4 a0 = *reinterpret_cast<uchar4*>(&q_int); 
        uchar4 b0 = *reinterpret_cast<uchar4*>(&db_int); 

        /* sum of absolute difference */ 
        sum[threadIdx.x] += abs( (int)a0.x - b0.x ); 
        sum[threadIdx.x] += abs( (int)a0.y - b0.y ); 
        sum[threadIdx.x] += abs( (int)a0.z - b0.z ); 
        sum[threadIdx.x] += abs( (int)a0.w - b0.w ); 

        i += THREADS_PER_BLOCK; 

    }

    __syncthreads(); 

    if ( blockSize >= 128 ) {
        if ( threadIdx.x < 64 ) { 
            sum[threadIdx.x] += sum[threadIdx.x + 64]; 
        }
    }

    /* reduce the final warp */
    if ( threadIdx.x < 32 ) {        
        if ( blockSize >= 64 ) { sum[threadIdx.x] += sum[threadIdx.x + 32]; } __syncthreads(); 

        if ( blockSize >= 32 ) { sum[threadIdx.x] += sum[threadIdx.x + 16]; } __syncthreads(); 

        if ( blockSize >= 16 ) { sum[threadIdx.x] += sum[threadIdx.x + 8 ]; } __syncthreads(); 

        if ( blockSize >= 8  ) { sum[threadIdx.x] += sum[threadIdx.x + 4 ]; } __syncthreads(); 

        if ( blockSize >= 4  ) { sum[threadIdx.x] += sum[threadIdx.x + 2 ]; } __syncthreads(); 

        if ( blockSize >= 2  ) { sum[threadIdx.x] += sum[threadIdx.x + 1 ]; } __syncthreads(); 

    }


    /* copy the sum back to global */
    if ( threadIdx.x == 0 ) {
        result[db_linear_index] = sum[0]; 
    }
}

I can get about a 4x bandwith increase if I run the kernel with the 4 lines of code which do the actual absolute difference calculation commented out, obviously it results in the wrong answer, but I believe that at least a significant portion of the time is spent there.

Is it possible that I am creating bank conflicts the way I am accessing the bytes? if so can I avoid conflicts?

Is my usage of reinterpret_cast correct?

Is there a better method for doing 8 bit unsigned calculations?

What other (I would assume many, as I'm a complete novice) optimisations can I make?

Thanks

EDIT:

My machine specs are as follows:

Windows XP 2002 SP3

intel 6600 2.40GHz

2GB ram

GT640 GDDR5 1gb

visual c++ 2010 express

Robert Crovella · Accepted Answer

It's good practice for questions like these to provide a complete code that someone can compile and run, without having to add anything or change anything. Generally speaking, SO expects this. Since your question is also about performance, you should also include in your complete code the actual timing measurement methodology.

FIXING ERRORS:

There are at least 2 errors in your code, one of which @Jez has pointed out already. After this "partial reduction" step:

if ( blockSize >= 128 ) {
    if ( threadIdx.x < 64 ) { 
        sum[threadIdx.x] += sum[threadIdx.x + 64]; 
    }
}

we need a __syncthreads(); before proceeding with the remainder. With the above change, I was able to get your kernel to produce repeatable results that matched my naive host implementation. Also, since you have conditional code like this which does not evaluate the same across the threadblock:

if ( threadIdx.x < 32 ) {

it is illegal to have a __syncthreads() statement inside the conditional code block:

  if ( blockSize >= 64 ) { sum[threadIdx.x] += sum[threadIdx.x + 32]; } __syncthreads();

(and likewise for the subsequent lines which do the same thing). So it's recommended to fix that. There are a few ways we can solve this, one of which is to switch to using a volatile typed pointer to refer to the shared data. Since we are now operating within a warp, the volatile qualifier forces the compiler to do what we want:

volatile UINT *vsum = sum;
if ( threadIdx.x < 32 ) {        
    if ( blockSize >= 64 ) vsum[threadIdx.x] += vsum[threadIdx.x + 32];
    if ( blockSize >= 32 ) vsum[threadIdx.x] += vsum[threadIdx.x + 16]; 
    if ( blockSize >= 16 ) vsum[threadIdx.x] += vsum[threadIdx.x + 8 ];
    if ( blockSize >= 8  ) vsum[threadIdx.x] += vsum[threadIdx.x + 4 ];
    if ( blockSize >= 4  ) vsum[threadIdx.x] += vsum[threadIdx.x + 2 ]; 
    if ( blockSize >= 2  ) vsum[threadIdx.x] += vsum[threadIdx.x + 1 ];
}

The CUDA parallel reduction sample code and associated pdf may be a good review for you.

TIMING/PERF ANALYSIS:

I happen to have a GT 640, cc3.5 device. When I run bandwidthTest on it, I observe about 32GB/s for device-to-device transfers. This number is representative of a reasonable approximate upper bound on achievable bandwidth when device kernels are accessing device memory. Also, when I add cudaEvent based timing and build a sample code around what you have shown, with simulated data, I observe a throughput of around 16GB/s, not 5GB/s. So your actual measurement technique would be useful info here (in fact, a complete code is probably what is needed to analyze the differences between my timing of your kernel, and your timing).

The question remains, then, can it be improved? (assuming ~32GB/s is approximate upper bound).

YOUR QUESTIONS:

Is it possible that I am creating bank conflicts the way I am accessing the bytes? if so can I avoid conflicts?

Since your kernel actually loads the bytes effectively as a 32-bit quantity (uchar4), and each thread is loading an adjacent, consecutive 32-bit quantity, I don't believe there are any bank-conflict access problems with your kernel.

Is my usage of reinterpret_cast correct?

Yes, it appears to be correct (my sample code below, with the above mentioned fixes, validates that results produced by your kernel match a naive host function implementation.)

Is there a better method for doing 8 bit unsigned calculations?

There is, and in this case, as pointed out by @njuffa, the SIMD intrinsics can handle this, as it turns out, with a single instruction (__vsadu4(), see the sample code below).

What other (I would assume many, as I'm a complete novice) optimisations can I make?

Use the cc3.0 warp-shuffle reduction method proposed by @MichalHosala
Make use of the SIMD instrinsic __vsadu4() to simplify and improve the handling of the byte quantities as proposed by @njuffa.
Reorganize your database vector data to be in column-major storage. This allows us to dispense with the ordinary parallel reduction method (even that as mentioned in item 1) and switch to a straight for-loop read kernel, one thread computing an entire vector comparison. This allows our kernel to reach approximately the memory bandwidth of the device in this case (cc3.5 GT640).

Here is the code and a sample run, showing 3 implementations: your original implementation (plus the above named "fixes" to get it to produce correct results), an opt1 kernel that modifies yours to include items 1 and 2 from the list above, and an opt2 kernel that replaces yours with an approach utilizing 2 and 3 from the list above. According to my measurement, your kernel achieves 16GB/s, about half of the bandwidth of a GT640, the opt1 kernel runs at about 24GB/s (the increase coming in approximately equal parts from items 1 and 2 above), and the opt2 kernel, with a data-reorganization, runs at approximately full bandwidth (36GB/s).

$ cat t574.cu
#include <stdio.h>
#include <stdlib.h>
#define THREADS_PER_BLOCK 128
#define VECTOR_SIZE 1024
#define NUM_DB_VEC 100000

typedef unsigned char BYTE;
typedef unsigned int UINT;
typedef unsigned int uint32_t;


template<UINT blockSize> __global__ void reduction_sum_abs( BYTE* query_vector, BYTE* db_vector, uint32_t* result )
{
    extern __shared__ UINT sum[];
    UINT db_linear_index = (blockIdx.y*gridDim.x) + blockIdx.x ;
    UINT i = threadIdx.x;

    sum[threadIdx.x] = 0;

    int* p_q_int = reinterpret_cast<int*>(query_vector);
    int* p_db_int = reinterpret_cast<int*>(db_vector);

    while( i < VECTOR_SIZE/4 ) {

        /* memory transaction */
        int q_int = p_q_int[i];
        int db_int = p_db_int[db_linear_index*VECTOR_SIZE/4 + i];

        uchar4 a0 = *reinterpret_cast<uchar4*>(&q_int);
        uchar4 b0 = *reinterpret_cast<uchar4*>(&db_int);

        /* sum of absolute difference */
        sum[threadIdx.x] += abs( (int)a0.x - b0.x );
        sum[threadIdx.x] += abs( (int)a0.y - b0.y );
        sum[threadIdx.x] += abs( (int)a0.z - b0.z );
        sum[threadIdx.x] += abs( (int)a0.w - b0.w );

        i += THREADS_PER_BLOCK;

    }

    __syncthreads();

    if ( blockSize >= 128 ) {
        if ( threadIdx.x < 64 ) {
            sum[threadIdx.x] += sum[threadIdx.x + 64];
        }
    }
    __syncthreads(); // **
    /* reduce the final warp */
    if ( threadIdx.x < 32 ) {
        if ( blockSize >= 64 ) { sum[threadIdx.x] += sum[threadIdx.x + 32]; } __syncthreads();

        if ( blockSize >= 32 ) { sum[threadIdx.x] += sum[threadIdx.x + 16]; } __syncthreads();

        if ( blockSize >= 16 ) { sum[threadIdx.x] += sum[threadIdx.x + 8 ]; } __syncthreads();

        if ( blockSize >= 8  ) { sum[threadIdx.x] += sum[threadIdx.x + 4 ]; } __syncthreads();

        if ( blockSize >= 4  ) { sum[threadIdx.x] += sum[threadIdx.x + 2 ]; } __syncthreads();

        if ( blockSize >= 2  ) { sum[threadIdx.x] += sum[threadIdx.x + 1 ]; } __syncthreads();

    }


    /* copy the sum back to global */
    if ( threadIdx.x == 0 ) {
        result[db_linear_index] = sum[0];
    }
}

__global__ void reduction_sum_abs_opt1( BYTE* query_vector, BYTE* db_vector, uint32_t* result )
{
  __shared__ UINT sum[THREADS_PER_BLOCK];
  UINT db_linear_index = (blockIdx.y*gridDim.x) + blockIdx.x ;
  UINT i = threadIdx.x;

  sum[threadIdx.x] = 0;

  UINT* p_q_int = reinterpret_cast<UINT*>(query_vector);
  UINT* p_db_int = reinterpret_cast<UINT*>(db_vector);

  while( i < VECTOR_SIZE/4 ) {

    /* memory transaction */
    UINT q_int = p_q_int[i];
    UINT db_int = p_db_int[db_linear_index*VECTOR_SIZE/4 + i];
    sum[threadIdx.x] += __vsadu4(q_int, db_int);

    i += THREADS_PER_BLOCK;

    }
  __syncthreads();
  // this reduction assumes THREADS_PER_BLOCK = 128
  if (threadIdx.x < 64) sum[threadIdx.x] += sum[threadIdx.x+64];
  __syncthreads();

  if ( threadIdx.x < 32 ) {
    unsigned localSum = sum[threadIdx.x] + sum[threadIdx.x + 32];
    for (int i = 16; i >= 1; i /= 2)
      localSum = localSum + __shfl_xor(localSum, i);
    if (threadIdx.x == 0) result[db_linear_index] = localSum;
    }
}

__global__ void reduction_sum_abs_opt2( BYTE* query_vector, UINT* db_vector_cm, uint32_t* result)
{
  __shared__ UINT qv[VECTOR_SIZE/4];
  if (threadIdx.x < VECTOR_SIZE/4) qv[threadIdx.x] = *(reinterpret_cast<UINT *>(query_vector) + threadIdx.x);
  __syncthreads();
  int idx = threadIdx.x + blockDim.x*blockIdx.x;
  while (idx < NUM_DB_VEC){
    UINT sum = 0;
    for (int i = 0; i < VECTOR_SIZE/4; i++)
      sum += __vsadu4(qv[i], db_vector_cm[(i*NUM_DB_VEC)+idx]);
    result[idx] = sum;
    idx += gridDim.x*blockDim.x;}
}

unsigned long compute_host_result(BYTE *qvec, BYTE *db_vec){

  unsigned long temp = 0;
  for (int i =0; i < NUM_DB_VEC; i++)
    for (int j = 0; j < VECTOR_SIZE; j++)
      temp += (unsigned long) abs((int)qvec[j] - (int)db_vec[(i*VECTOR_SIZE)+j]);
  return temp;
}

int main(){

  float et;
  cudaEvent_t start, stop;
  BYTE *h_qvec, *d_qvec, *h_db_vec, *d_db_vec;
  uint32_t *h_res, *d_res;
  h_qvec =   (BYTE *)malloc(VECTOR_SIZE*sizeof(BYTE));
  h_db_vec = (BYTE *)malloc(VECTOR_SIZE*NUM_DB_VEC*sizeof(BYTE));
  h_res = (uint32_t *)malloc(NUM_DB_VEC*sizeof(uint32_t));
  for (int i = 0; i < VECTOR_SIZE; i++){
    h_qvec[i] = rand()%256;
    for (int j = 0; j < NUM_DB_VEC; j++) h_db_vec[(j*VECTOR_SIZE)+i] = rand()%256;}
  cudaMalloc(&d_qvec, VECTOR_SIZE*sizeof(BYTE));
  cudaMalloc(&d_db_vec, VECTOR_SIZE*NUM_DB_VEC*sizeof(BYTE));
  cudaMalloc(&d_res, NUM_DB_VEC*sizeof(uint32_t));
  cudaMemcpy(d_qvec, h_qvec, VECTOR_SIZE*sizeof(BYTE), cudaMemcpyHostToDevice);
  cudaMemcpy(d_db_vec, h_db_vec, VECTOR_SIZE*NUM_DB_VEC*sizeof(BYTE), cudaMemcpyHostToDevice);
  cudaEventCreate(&start); cudaEventCreate(&stop);

// initial run

  cudaMemset(d_res, 0, NUM_DB_VEC*sizeof(uint32_t));
  cudaEventRecord(start);
  reduction_sum_abs<THREADS_PER_BLOCK><<<NUM_DB_VEC, THREADS_PER_BLOCK, THREADS_PER_BLOCK*sizeof(int)>>>(d_qvec, d_db_vec, d_res);
  cudaEventRecord(stop);
  cudaDeviceSynchronize();
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  cudaMemcpy(h_res, d_res, NUM_DB_VEC*sizeof(uint32_t), cudaMemcpyDeviceToHost);
  unsigned long h_result = 0;
  for (int i = 0; i < NUM_DB_VEC; i++) h_result += h_res[i];
  printf("1: et: %.2fms, bw: %.2fGB/s
", et, (NUM_DB_VEC*VECTOR_SIZE)/(et*1000000));
  if (h_result == compute_host_result(h_qvec, h_db_vec)) printf("Success!
");
  else printf("1: mismatch!
");

// optimized kernel 1
  cudaMemset(d_res, 0, NUM_DB_VEC*sizeof(uint32_t));
  cudaEventRecord(start);
  reduction_sum_abs_opt1<<<NUM_DB_VEC, THREADS_PER_BLOCK>>>(d_qvec, d_db_vec, d_res);
  cudaEventRecord(stop);
  cudaDeviceSynchronize();
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  cudaMemcpy(h_res, d_res, NUM_DB_VEC*sizeof(uint32_t), cudaMemcpyDeviceToHost);
  h_result = 0;
  for (int i = 0; i < NUM_DB_VEC; i++) h_result += h_res[i];
  printf("2: et: %.2fms, bw: %.2fGB/s
", et, (NUM_DB_VEC*VECTOR_SIZE)/(et*1000000));
  if(h_result == compute_host_result(h_qvec, h_db_vec)) printf("Success!
");
  else printf("2: mismatch!
");

// convert db_vec to column-major storage for optimized kernel 2

  UINT *h_db_vec_cm, *d_db_vec_cm;
  h_db_vec_cm = (UINT *)malloc(NUM_DB_VEC*(VECTOR_SIZE/4)*sizeof(UINT));
  cudaMalloc(&d_db_vec_cm, NUM_DB_VEC*(VECTOR_SIZE/4)*sizeof(UINT));
  for (int i = 0; i < NUM_DB_VEC; i++)
    for (int j = 0; j < VECTOR_SIZE/4; j++)
      h_db_vec_cm[(j*NUM_DB_VEC)+i] = *(reinterpret_cast<UINT *>(h_db_vec + (i*VECTOR_SIZE))+j);
  cudaMemcpy(d_db_vec_cm, h_db_vec_cm, NUM_DB_VEC*(VECTOR_SIZE/4)*sizeof(UINT), cudaMemcpyHostToDevice);
  cudaMemset(d_res, 0, NUM_DB_VEC*sizeof(uint32_t));
  cudaEventRecord(start);
  reduction_sum_abs_opt2<<<64, 512>>>(d_qvec, d_db_vec_cm, d_res);
  cudaEventRecord(stop);
  cudaDeviceSynchronize();
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  cudaMemcpy(h_res, d_res, NUM_DB_VEC*sizeof(uint32_t), cudaMemcpyDeviceToHost);
  h_result = 0;
  for (int i = 0; i < NUM_DB_VEC; i++) h_result += h_res[i];
  printf("3: et: %.2fms, bw: %.2fGB/s
", et, (NUM_DB_VEC*VECTOR_SIZE)/(et*1000000));
  if(h_result == compute_host_result(h_qvec, h_db_vec)) printf("Success!
");
  else printf("3: mismatch!
");

  return 0;
}

$ nvcc -O3 -arch=sm_35 -o t574 t574.cu
$ ./run35 t574
1: et: 6.34ms, bw: 16.14GB/s
Success!
2: et: 4.16ms, bw: 24.61GB/s
Success!
3: et: 2.83ms, bw: 36.19GB/s
Success!
$

A few notes:

The above code, in particular your kernel, must be compiled for cc3.0 or higher, the way I have the test case set up. This is because I am creating 100,000 blocks in a single 1D grid, so we cannot run this as-is on a cc2.0 device, for example.
THere may be some additional slight tuning that is possible, especially if run on different devices, for the opt2 kernel, by modifying the grid and block parameters. I have these set at 64 and 512, and these values should not be critical (although block should be VECTOR_SIZE/4 threads or greater), since the algorithm uses a grid-striding loop to cover the entire vector set. A GT640 has only 2 SM's, so in this case 64 threadblocks is enough to keep the device busy (probably even 32 would be OK). You might want to modify these to get max performance on larger devices.

Michal Hosala · Answer

One thing got my attention right away:

if ( blockSize >= 128 ) {
    if ( threadIdx.x < 64 ) { 
        sum[threadIdx.x] += sum[threadIdx.x + 64]; 
    }
}

First condition is true everywhere, while the second only in first two warps. Therefore you could benefit from switching their order as:

if ( threadIdx.x < 64 ) {
    if ( blockSize >= 128 ) { 
        sum[threadIdx.x] += sum[threadIdx.x + 64]; 
    }
}

This would allow all warps except for the first two to finish their execution sooner.

The next thing - you can speed up reduction on warp level quite significantly using the __shfl_xor instruction:

/* reduce the final warp */
if ( threadIdx.x < 32 ) {
  auto localSum = sum[threadIdx.x] + sum[threadIdx.x + 32]); 
  for (auto i = 16; i >= 1; i /= 2)
  {
      localSum = localSum + __shfl_xor(localSum, i);
  }

  if (threadIdx.x == 0) result[db_linear_index] = localSum;
}

I am not saying that is it and there are no more issues with your code but these are issues I was able to spot quite easily. I haven't even tested the performance using my solution, but I believe it should improve.

Edit: It also seems that you are unnecessarily writing to shared memory four times:

/* sum of absolute difference */ 
sum[threadIdx.x] += abs( (int)a0.x - b0.x ); 
sum[threadIdx.x] += abs( (int)a0.y - b0.y ); 
sum[threadIdx.x] += abs( (int)a0.z - b0.z ); 
sum[threadIdx.x] += abs( (int)a0.w - b0.w );

Why not to simply do the following?

    /* sum of absolute difference */ 
sum[threadIdx.x] += abs( (int)a0.x - b0.x )
    + abs( (int)a0.y - b0.y )
    + abs( (int)a0.z - b0.z ); 
    + abs( (int)a0.w - b0.w );

Optimise byte operations CUDA

Tags:

c++

optimization

byte

cuda

absolute-value

user3678912

2 Answers

Robert Crovella

Michal Hosala

Recent Activity

Donate For Us

Optimise byte operations CUDA

Tags:

c++

optimization

byte

cuda

absolute-value

user3678912

2 Answers

Robert Crovella

Michal Hosala

Related questions

Recent Activity

Donate For Us