speedup when using float4, opencl

Question

I have the following opencl kernel function to get the column sum of a image.

__kernel void columnSum(__global float* src,__global float* dst,int srcCols,
                            int srcRows,int srcStep,int dstStep)   
{

    const int x = get_global_id(0);
    srcStep >>= 2;
    dstStep >>= 2;

    if (x < srcCols)
    {
        int srcIdx = x ;
        int dstIdx = x ;

        float sum = 0;

        for (int y = 0; y < srcRows; ++y)
        {
            sum += src[srcIdx];
            dst[dstIdx] = sum;
            srcIdx += srcStep;
            dstIdx += dstStep;
        }
    }
}

I assign that each thread process a column here so that a lot of threads can get the column_sum of each column in parallel.

I also use float4 to rewrite the above kernel so that each thread can read 4 elements in a row at one time from the source image, which is shown below.

__kernel void columnSum(__global float* src,__global float* dst,int srcCols,
                            int srcRows,int srcStep,int dstStep)
{

    const int x = get_global_id(0);

    srcStep >>= 2;
    dstStep >>= 2;
    if (x < srcCols/4)
    {
        int srcIdx = x ;
        int dstIdx = x ;

        float4 sum = (float4)(0.0f, 0.0f, 0.0f, 0.0f);

        for (int y = 0; y < srcRows; ++y)
        {
            float4 temp2;
            temp2 = vload4(0, &src[4 * srcIdx]);
            sum = sum + temp2;

            vstore4(sum, 0, &dst[4 * dstIdx]);

            srcIdx += (srcStep/4);
            dstIdx += (dstStep/4);
        }
    }
}

In this case, theoretically, I think the time consumed by the second kernel to process a image should be 1/4 of the time consumed by the first kernel function. However, no matter how large the image is, the two kernels almost consume the same time. I don't know why. Can you guys give me some ideas? T

Paul Jurczak · Accepted Answer

OpenCL vector data types like float4 were fitting better the older GPU architectures, especially AMD's GPUs. Modern GPUs don't have SIMD registers available for individual work-items, they are scalar in that respect. CL_DEVICE_PREFERRED_VECTOR_WIDTH_* equals 1 for OpenCL driver on NVIDIA Kepler GPU and Intel HD integrated graphics. So adding float4 vectors on modern GPU should require 4 operations. On the other hand, OpenCL driver on Intel Core CPU has CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT equal to 4, so these vectors could be added in a single step.

speedup when using float4, opencl

Tags:

opencl

user2326258

1 Answers

Paul Jurczak

Recent Activity

Donate For Us

speedup when using float4, opencl

Tags:

opencl

user2326258

1 Answers

Paul Jurczak

Related questions

Recent Activity

Donate For Us