I have started learning OpenCL and I currently try to test how much I can improve performance for a simple skeletal animation algorithm. To do this I have written a program that performs skeletal animation from randomly generated vertices and transformation matrices twice, once with an SSE-optimized linear algebra library in plain C++, and once using my own OpenCL kernel on GPU (I'm testing on an Nvidia GTX 460).
I started off with a simple kernel where each work-item transforms exactly one vertex, with all values read from global memory. Because I was not satisfied with the performance of this kernel, I tried to optimize a little. My current kernel looks like this:
inline float4 MultiplyMatrixVector(float16 m, float4 v)
{
return (float4) (
dot(m.s048C, v),
dot(m.s159D, v),
dot(m.s26AE, v),
dot(m.s37BF, v)
);
}
kernel void skelanim(global const float16* boneMats, global const float4* vertices, global const float4* weights, global const uint4* indices, global float4* resVertices)
{
int gid = get_global_id(0);
int lid = get_local_id(0);
local float16 lBoneMats[NUM_BONES];
async_work_group_copy(lBoneMats, boneMats, NUM_BONES, 0);
barrier(CLK_LOCAL_MEM_FENCE);
for (int i = 0 ; i < NUM_VERTICES_PER_WORK_ITEM ; i++) {
int vidx = gid*NUM_VERTICES_PER_WORK_ITEM + i;
float4 vertex = vertices[vidx];
float4 w = weights[vidx];
uint4 idx = indices[vidx];
resVertices[vidx] = (MultiplyMatrixVector(lBoneMats[idx.x], vertex * w.x)
+ MultiplyMatrixVector(lBoneMats[idx.y], vertex * w.y)
+ MultiplyMatrixVector(lBoneMats[idx.z], vertex * w.z)
+ MultiplyMatrixVector(lBoneMats[idx.w], vertex * w.w));
}
}
Now I process a constant number of vertices per work-item, and I prefetch all the bone matrices into local memory only once for each work-item, which I believed would lead to way better performance because the matrices for multiple vertices could be read from the faster local memory afterwards. Unfortunately, this kernel performs worse than my first attempt, and even worse than the CPU-only implementation.
Why is performance so bad with this should-be optimization?
If it helps, here is how I execute the kernel:
#define NUM_BONES 50
#define NUM_VERTICES 30000
#define NUM_VERTICES_PER_WORK_ITEM 100
#define NUM_ANIM_REPEAT 1000
uint64_t PerformOpenCLSkeletalAnimation(Matrix4* boneMats, Vector4* vertices, float* weights, uint32_t* indices, Vector4* resVertices)
{
File kernelFile("/home/alemariusnexus/test/skelanim.cl");
char opts[256];
sprintf(opts, "-D NUM_VERTICES=%u -D NUM_REPEAT=%u -D NUM_BONES=%u -D NUM_VERTICES_PER_WORK_ITEM=%u", NUM_VERTICES, NUM_ANIM_REPEAT, NUM_BONES, NUM_VERTICES_PER_WORK_ITEM);
cl_program prog = BuildOpenCLProgram(kernelFile, opts);
cl_kernel kernel = clCreateKernel(prog, "skelanim", NULL);
cl_mem boneMatBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_BONES*sizeof(Matrix4), boneMats, NULL);
cl_mem vertexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*sizeof(Vector4), vertices, NULL);
cl_mem weightBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(float), weights, NULL);
cl_mem indexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(uint32_t), indices, NULL);
cl_mem resVertexBuf = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, NUM_VERTICES*sizeof(Vector4), NULL, NULL);
uint64_t s, e;
s = GetTickcount();
clSetKernelArg(kernel, 0, sizeof(cl_mem), &boneMatBuf);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &vertexBuf);
clSetKernelArg(kernel, 2, sizeof(cl_mem), &weightBuf);
clSetKernelArg(kernel, 3, sizeof(cl_mem), &indexBuf);
clSetKernelArg(kernel, 4, sizeof(cl_mem), &resVertexBuf);
size_t globalWorkSize[] = { NUM_VERTICES / NUM_VERTICES_PER_WORK_ITEM };
size_t localWorkSize[] = { NUM_BONES };
for (size_t i = 0 ; i < NUM_ANIM_REPEAT ; i++) {
clEnqueueNDRangeKernel(cq, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);
}
clEnqueueReadBuffer(cq, resVertexBuf, CL_TRUE, 0, NUM_VERTICES*sizeof(Vector4), resVertices, 0, NULL, NULL);
e = GetTickcount();
return e-s;
}
I guess there are more things that could be optimized, maybe batching some of the other global reads together, but first I would really like to know why this first optimization didn't work.
Two things are affecting the performance in your exercise.
1) OpenCL
conforms to C99
std that does not contain anything about inline functions, i.e. the clcc compiler either just ignores the inline
keyword and does a regular call, or it supports the inlining silently. But it is not mandated to support that feature.
So, better define your MultiplyMatrixVector
as a pre-processor macro. Though this is not a major problem in your case.
2) You incorrectly threat the local memory (the LDM
).
Although its latency times less than the latency of the global memory
when it accessed properly, the local memory
is subject to bank conflicts.
Your vertex index is calculated with stride 100 per work item. The number of banks depends on the GPU in use but usually it is 16 or 32, i.e. you may access up to 16(32) four byte LDM
variables in one cycle without penalty if all of them are in different banks. Otherwise, you get a bank conflict
(when two or more threads accesses the same bank) that is serialized.
Your 100 threads in a work group accesses the array in LDM
with no special arrangement about bank conflicts. Moreover, the array elements are float16, i.e. a single element spans all 16 banks (or half of 32 banks). Thus, you have a bank conflict in each row of MultiplyMatrixVector
function. The cummulative degree
that conflict at least 16x32 (here 16 is the number of the vector elements you access and 32 is a size of half wavefront or halfwarp).
The solution here is not to copy that array to LDM
, but to allocate it in the host with CL_MEM_READ_ONLY
(which you already did) and declare your kernel using __constant
specifier for boneMats
argument.
Then the OpenCL
library would allocate the memory in the constant area inside GPU
and the access to that array would be fast:
kernel void skelanim(__constant const float16* boneMats,
global const float4* vertices,
global const float4* weights,
global const uint4* indices,
global float4* resVertices)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With