Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenCL scalar vs vector

Tags:

gpgpu

gpu

opencl

I have simple kernel:

__kernel vecadd(__global const float *A,
                __global const float *B,
                __global float *C)
{
    int idx = get_global_id(0);
    C[idx] = A[idx] + B[idx];
}

Why when I change float to float4, kernel runs more than 30% slower?

All tutorials says, that using vector types speeds up computation...

On host side, memory alocated for float4 arguments is 16 bytes aligned and global_work_size for clEnqueueNDRangeKernel is 4 times smaller.

Kernel runs on AMD HD5770 GPU, AMD-APP-SDK-v2.6.

Device info for CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT returns 4.

EDIT:
global_work_size = 1024*1024 (and greater)
local_work_size = 256
Time measured using CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END.

For smaller global_work_size (8196 for float / 2048 for float4), vectorized version is faster, but I would like to know, why?

like image 824
ldanko Avatar asked Jan 19 '12 21:01

ldanko


2 Answers

I don't know what are the tutorials you refer to, but they must be old. Both ATI and NVIDIA use scalar gpu architectures for at least half-decade now. Nowdays using vectors in your code is only for syntactical convenience, it bears no performance benefit over plain scalar code. It turns out scalar architecture is better for GPUs than vectored - it is better at utilizing the hardware resources.

like image 125
lucho Avatar answered Sep 23 '22 14:09

lucho


I am not sure why the vectors would be that much slower for you, without knowing more about workgroup and global size. I would expect it to at least the same performance.

If it is suitable for your kernel, can you start with C having the values in A? This would cut down memory access by 33%. Maybe this applies to your situation?

__kernel vecadd(__global const float4 *B,
                __global float4 *C)
{
    int idx = get_global_id(0);
    C[idx] += B[idx];
}

Also, have you tired reading in the values to a private vector, then adding? Or maybe both strategies.

__kernel vecadd(__global const float4 *A,
                __global const float4 *B,
                __global float4 *C)
{
    int idx = get_global_id(0);
    float4 tmp = A[idx] + B[idx];
    C[idx] = tmp;
}
like image 30
mfa Avatar answered Sep 25 '22 14:09

mfa