OpenCL Performance Optimization

Question

I have started learning OpenCL and I currently try to test how much I can improve performance for a simple skeletal animation algorithm. To do this I have written a program that performs skeletal animation from randomly generated vertices and transformation matrices twice, once with an SSE-optimized linear algebra library in plain C++, and once using my own OpenCL kernel on GPU (I'm testing on an Nvidia GTX 460).

I started off with a simple kernel where each work-item transforms exactly one vertex, with all values read from global memory. Because I was not satisfied with the performance of this kernel, I tried to optimize a little. My current kernel looks like this:

inline float4 MultiplyMatrixVector(float16 m, float4 v)
{
    return (float4) (
        dot(m.s048C, v),
        dot(m.s159D, v),
        dot(m.s26AE, v),
        dot(m.s37BF, v)
    );
}


kernel void skelanim(global const float16* boneMats, global const float4* vertices, global const float4* weights, global const uint4* indices, global float4* resVertices)
{
    int gid = get_global_id(0);
    int lid = get_local_id(0);

    local float16 lBoneMats[NUM_BONES];
    async_work_group_copy(lBoneMats, boneMats, NUM_BONES, 0);

    barrier(CLK_LOCAL_MEM_FENCE);

    for (int i = 0 ; i < NUM_VERTICES_PER_WORK_ITEM ; i++) {
        int vidx = gid*NUM_VERTICES_PER_WORK_ITEM + i;

        float4 vertex = vertices[vidx];
        float4 w = weights[vidx];
        uint4 idx = indices[vidx];

        resVertices[vidx] = (MultiplyMatrixVector(lBoneMats[idx.x], vertex * w.x)
                + MultiplyMatrixVector(lBoneMats[idx.y], vertex * w.y)
                + MultiplyMatrixVector(lBoneMats[idx.z], vertex * w.z)
                + MultiplyMatrixVector(lBoneMats[idx.w], vertex * w.w));
    }
}

Now I process a constant number of vertices per work-item, and I prefetch all the bone matrices into local memory only once for each work-item, which I believed would lead to way better performance because the matrices for multiple vertices could be read from the faster local memory afterwards. Unfortunately, this kernel performs worse than my first attempt, and even worse than the CPU-only implementation.

Why is performance so bad with this should-be optimization?

If it helps, here is how I execute the kernel:

#define NUM_BONES 50
#define NUM_VERTICES 30000
#define NUM_VERTICES_PER_WORK_ITEM 100
#define NUM_ANIM_REPEAT 1000

uint64_t PerformOpenCLSkeletalAnimation(Matrix4* boneMats, Vector4* vertices, float* weights, uint32_t* indices, Vector4* resVertices)
{
    File kernelFile("/home/alemariusnexus/test/skelanim.cl");

    char opts[256];
    sprintf(opts, "-D NUM_VERTICES=%u -D NUM_REPEAT=%u -D NUM_BONES=%u -D NUM_VERTICES_PER_WORK_ITEM=%u", NUM_VERTICES, NUM_ANIM_REPEAT, NUM_BONES, NUM_VERTICES_PER_WORK_ITEM);

    cl_program prog = BuildOpenCLProgram(kernelFile, opts);

    cl_kernel kernel = clCreateKernel(prog, "skelanim", NULL);

    cl_mem boneMatBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_BONES*sizeof(Matrix4), boneMats, NULL);
    cl_mem vertexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*sizeof(Vector4), vertices, NULL);
    cl_mem weightBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(float), weights, NULL);
    cl_mem indexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(uint32_t), indices, NULL);
    cl_mem resVertexBuf = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, NUM_VERTICES*sizeof(Vector4), NULL, NULL);

    uint64_t s, e;
    s = GetTickcount();

    clSetKernelArg(kernel, 0, sizeof(cl_mem), &boneMatBuf);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &vertexBuf);
    clSetKernelArg(kernel, 2, sizeof(cl_mem), &weightBuf);
    clSetKernelArg(kernel, 3, sizeof(cl_mem), &indexBuf);
    clSetKernelArg(kernel, 4, sizeof(cl_mem), &resVertexBuf);

    size_t globalWorkSize[] = { NUM_VERTICES / NUM_VERTICES_PER_WORK_ITEM };
    size_t localWorkSize[] = { NUM_BONES };

    for (size_t i = 0 ; i < NUM_ANIM_REPEAT ; i++) {
        clEnqueueNDRangeKernel(cq, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);
    }

    clEnqueueReadBuffer(cq, resVertexBuf, CL_TRUE, 0, NUM_VERTICES*sizeof(Vector4), resVertices, 0, NULL, NULL);

    e = GetTickcount();

    return e-s;
}

I guess there are more things that could be optimized, maybe batching some of the other global reads together, but first I would really like to know why this first optimization didn't work.

Serge · Accepted Answer

Two things are affecting the performance in your exercise.

1) OpenCL conforms to C99 std that does not contain anything about inline functions, i.e. the clcc compiler either just ignores the inline keyword and does a regular call, or it supports the inlining silently. But it is not mandated to support that feature.

So, better define your MultiplyMatrixVector as a pre-processor macro. Though this is not a major problem in your case.

2) You incorrectly threat the local memory (the LDM).

Although its latency times less than the latency of the global memory when it accessed properly, the local memory is subject to bank conflicts.

Your vertex index is calculated with stride 100 per work item. The number of banks depends on the GPU in use but usually it is 16 or 32, i.e. you may access up to 16(32) four byte LDM variables in one cycle without penalty if all of them are in different banks. Otherwise, you get a bank conflict (when two or more threads accesses the same bank) that is serialized. Your 100 threads in a work group accesses the array in LDM with no special arrangement about bank conflicts. Moreover, the array elements are float16, i.e. a single element spans all 16 banks (or half of 32 banks). Thus, you have a bank conflict in each row of MultiplyMatrixVector function. The cummulative degree that conflict at least 16x32 (here 16 is the number of the vector elements you access and 32 is a size of half wavefront or halfwarp).

The solution here is not to copy that array to LDM, but to allocate it in the host with CL_MEM_READ_ONLY (which you already did) and declare your kernel using __constant specifier for boneMats argument. Then the OpenCL library would allocate the memory in the constant area inside GPU and the access to that array would be fast:

kernel void skelanim(__constant const float16* boneMats, 
                     global const float4* vertices, 
                     global const float4* weights, 
                     global const uint4* indices, 
                     global float4* resVertices)

OpenCL Performance Optimization

Tags:

optimization

gpgpu

opencl

Alemarius Nexus

1 Answers

Serge

Recent Activity

Donate For Us

OpenCL Performance Optimization

Tags:

optimization

gpgpu

opencl

Alemarius Nexus

1 Answers

Serge

Related questions

Recent Activity

Donate For Us