Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cuda Bayer/CFA demosaicing example

I've written a CUDA4 Bayer demosaicing routine, but it's slower than single threaded CPU code, running on a16core GTS250.
Blocksize is (16,16) and the image dims are a multiple of 16 - but changing this doesn't improve it.

Am I doing anything obviously stupid?

--------------- calling routine ------------------
uchar4 *d_output;
size_t num_bytes; 

cudaGraphicsMapResources(1, &cuda_pbo_resource, 0);    
cudaGraphicsResourceGetMappedPointer((void **)&d_output, &num_bytes, cuda_pbo_resource);

// Do the conversion, leave the result in the PBO fordisplay
kernel_wrapper( imageWidth, imageHeight, blockSize, gridSize, d_output );

cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0);

--------------- cuda -------------------------------
texture<uchar, 2, cudaReadModeElementType> tex;
cudaArray *d_imageArray = 0;

__global__ void convertGRBG(uchar4 *d_output, uint width, uint height)
{
    uint x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
    uint y = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;
    uint i = __umul24(y, width) + x;

    // input is GR/BG output is BGRA
    if ((x < width) && (y < height)) {

        if ( y & 0x01 ) {
            if ( x & 0x01 ) {  
                d_output[i].x =  (tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2;  // B                
                d_output[i].y = (tex2D(tex,x,y));     // G in B
                d_output[i].z = (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2;  // R                    
            } else {
                d_output[i].x = (tex2D(tex,x,y));        //B
                d_output[i].y = (tex2D(tex,x+1,y) + tex2D(tex,x-1,y)+tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/4;  // G
                d_output[i].z = (tex2D(tex,x+1,y+1) + tex2D(tex,x+1,y-1)+tex2D(tex,x-1,y+1)+tex2D(tex,x-1,y-1))/4;   // R
            }
        } else {
            if ( x & 0x01 ) {
                 // odd col = R
                d_output[i].y = (tex2D(tex,x+1,y+1) + tex2D(tex,x+1,y-1)+tex2D(tex,x-1,y+1)+tex2D(tex,x-1,y-1))/4;  // B
                d_output[i].z = (tex2D(tex,x,y));        //R
                d_output[i].y = (tex2D(tex,x+1,y) + tex2D(tex,x-1,y)+tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/4;  // G    
            } else {    
                d_output[i].x = (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2;  // B
                d_output[i].y = (tex2D(tex,x,y));               // G  in R               
                d_output[i].z = (tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2;  // R                    
            }
        }                                
    }
}



void initTexture(int imageWidth, int imageHeight, uchar *imagedata)
{

    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
    cutilSafeCall( cudaMallocArray(&d_imageArray, &channelDesc, imageWidth, imageHeight) ); 
    uint size = imageWidth * imageHeight * sizeof(uchar);
    cutilSafeCall( cudaMemcpyToArray(d_imageArray, 0, 0, imagedata, size, cudaMemcpyHostToDevice) );
    cutFree(imagedata);

    // bind array to texture reference with point sampling
    tex.addressMode[0] = cudaAddressModeClamp;
    tex.addressMode[1] = cudaAddressModeClamp;
    tex.filterMode = cudaFilterModePoint;
    tex.normalized = false; 

    cutilSafeCall( cudaBindTextureToArray(tex, d_imageArray) );
}
like image 910
Martin Beckett Avatar asked Nov 09 '11 21:11

Martin Beckett


1 Answers

There aren't any obvious bugs in your code, but there are several obvious performance opportunities:

1) for best performance, you should use texture to stage into shared memory - see the 'SobelFilter' SDK sample.

2) As written, the code is writing bytes to global memory, which is guaranteed to incur a large performance hit. You can use shared memory to stage results before committing them to global memory.

3) There is a surprisingly big performance advantage to sizing blocks in a way that match the hardware's texture cache attributes. On Tesla-class hardware, the optimal block size for kernels using the same addressing scheme as your kernel is 16x4. (64 threads per block)

For workloads like this, it may be hard to compete with optimized CPU code. SSE2 can do 16 byte-sized operations in a single instruction, and CPUs are clocked about 5 times as fast.

like image 124
ArchaeaSoftware Avatar answered Sep 22 '22 12:09

ArchaeaSoftware