OpenCL AMD vs NVIDIA performance

Tags:

I implemented a simple kernel which is some sort of a convolution. I measured it on NVIDIA GT 240. It took 70 ms when written on CUDA and 100 ms when written on OpenCL. Ok, I thought, NVIDIA compiler is better optimized for CUDA (or I'm doing something wrong). I need to run it on AMD GPUs, so I migrated to AMD APP SDK. Exactly the same kernel code.

I made two tests and their results were discouraging for me: 200 ms at HD 6670 and 70 ms at HD 5850 (the same time as for GT 240 + CUDA). And I am very interested of the reasons of such strange behaviour.

All projects were built on VS2010 using settings from the sample projects of NVIDIA and AMD respectively.

Please, do not consider my post as NVIDIA advertisement. I fairly understand that HD 5850 is more powerful than GT 240. The only thing I wish to know is why such difference is and how to fix the problem.

Update. Below is the kernel code which looks for 6 equally sized template images in the base one. Every pixel of the base image is considered as a possible origin of one of the templates and is processed by a separate thread. The kernel compares R, G, B values of each pixel of the base image and of the template one, and if at least one difference exceeds diff parameter, the corresponding pixel is counted nonmatched. If the number of nonmatched pixels is less than maxNonmatchQt the corresponding template is hit.

__constant int tOffset = 8196; // one template size in memory (in bytes)
__kernel void matchImage6( __global unsigned char* image, // pointer to the base image
            int imgWidth, // base image width
            int imgHeight, // base image height
            int imgPitch, // base image pitch (in bytes)
            int imgBpp, // base image bytes (!) per pixel
            __constant unsigned char* templates, // pointer to the array of templates
            int tWidth, // templates width (the same for all)
            int tHeight, // templates height (the same for all)
            int tPitch, // templates pitch (in bytes, the same for all)
            int tBpp, // templates bytes (!) per pixel (the same for all)
            int diff, // max allowed difference of intensity
            int maxNonmatchQt, // max number of nonmatched pixels
            __global int* result, // results
                            ) {
int x0 = (int)get_global_id(0);
int y0 = (int)get_global_id(1);
if( x0 + tWidth > imgWidth || y0 + tHeight > imgHeight)
    return;
int nonmatchQt[] = {0, 0, 0, 0, 0, 0};
for( int y = 0; y < tHeight; y++) {
    int ind = y * tPitch;
    int baseImgInd = (y0 + y) * imgPitch + x0 * imgBpp;
    for( int x = 0; x < tWidth; x++) {
        unsigned char c0 = image[baseImgInd];
        unsigned char c1 = image[baseImgInd + 1];
        unsigned char c2 = image[baseImgInd + 2];
        for( int i = 0; i < 6; i++)
            if( abs( c0 - templates[i * tOffset + ind]) > diff || 
                            abs( c1 - templates[i * tOffset + ind + 1]) > diff || 
                            abs( c2 - templates[i * tOffset + ind + 2]) > diff)
                nonmatchQt[i]++;
        ind += tBpp;
        baseImgInd += imgBpp;
    }
    if( nonmatchQt[0] > maxNonmatchQt && nonmatchQt[1] > maxNonmatchQt && nonmatchQt[2] > maxNonmatchQt && nonmatchQt[3] > maxNonmatchQt && nonmatchQt[4] > maxNonmatchQt && nonmatchQt[5] > maxNonmatchQt)
        return;
}
for( int i = 0; i < 6; i++)
    if( nonmatchQt[i] < maxNonmatchQt) {
        unsigned int pos = atom_inc( &result[0]) * 3;
        result[pos + 1] = i;
        result[pos + 2] = x0;
        result[pos + 3] = y0;
    }
}

Kernel run configuration: Global work size = (1900, 1200) Local work size = (32, 8) for AMD and (32, 16) for NVIDIA.

Execution time: HD 5850 - 69 ms, HD 6670 - 200 ms, GT 240 - 100 ms.

Any remarks about my code are also highly appreciated.

246

asked Jan 23 '12 11:01

AdelNick

1 Answers

The difference in execution times is caused by compilers. Your code can be easily vectorized. Consider image and templates as arrays of vector type char4 (forth coordinate of each char4 vector is always 0). Instead of 3 memory reads:

unsigned char c0 = image[baseImgInd];
unsigned char c1 = image[baseImgInd + 1];
unsigned char c2 = image[baseImgInd + 2];

use only one:

unsigned char4 c = image[baseImgInd];

Instead of bulky if:

    if( abs( c0 - templates[i * tOffset + ind]) > diff || 
               abs( c1 - templates[i * tOffset + ind + 1]) > diff || 
               abs( c2 - templates[i * tOffset + ind + 2]) > diff)
         nonmatchQt[i]++;

use fast:

    unsigned char4 t = templates[i * tOffset + ind];
    nonmatchQt[i] += any(abs_diff(c,t)>diff);

Thus you increase performance of your code up to 3 times (if compiler doesn't vectorize the code by itself). I suppose that AMD OpenCL compiler does not such vectorization and other optimizations. From my experience OpenCL on NVIDIA GPU usually can be made faster than CUDA, because it is more low-level.

104

answered Nov 09 '22 16:11

gudasergey

Related questions
                            
                                What is a constant address space qualifier in OpenCL?
                            
                                How to compile OpenCL kernel into bitstream?
                            
                                opencl matrix library
                            
                                GPU reads from CPU or CPU writes to the GPU?
                            
                                OpenCL: Device / Host memory coherence for variables passed to kernel with CL_MEM_USE_HOST_PTR
                            
                                openCL reduction, and passing 2d array
                            
                                Boost.Compute slower than plain CPU?
                            
                                Access Path in Zero-Copy in OpenCL
                            
                                How to use 2 OpenCL runtimes
                            
                                What's the advantage of running OpenCL code on aCPU? [closed]
                            
                                why are my openCL builds failing on OS X?
                            
                                Create local array dynamic inside OpenCL kernel
                            
                                How to make the most of SIMD in OpenCL?
                            
                                Threading opencl compiling
                            
                                Strategy for doing final reduction
                            
                                Is there an alternative to OpenCL+PyOpenCL for multiplatform GPGPU compute?
                            
                                Passing array to function parameter in OpenCL
                            
                                Efficient bucket-sort on GPU
                            
                                Differences between clBLAS and ViennaCL?
                            
                                Using OpenCL in the guest system with CPU on Virtual Box, VMWare or Parallels?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

OpenCL AMD vs NVIDIA performance

Tags:

gpgpu

nvidia

opencl

AdelNick

People also ask

1 Answers

gudasergey

Recent Activity

Donate For Us