Accessing GPU buffer by index

Question

Note: My question regards Apple's Metal API, but I think the concept is general enough to translate to other GPU frameworks as well.

My goal: add a 1 x N row vector b to each row in an M x N matrix A.

My kernel, reduced to the part I'm having trouble with:

kernel void vmadd(const device float* A [[ buffer(0) ]],
                  const device float* b [[ buffer(1) ]],
                  device float* C [[ buffer(2) ]],
                  constant ushort& aWidth [[ buffer(3) ]],
                  ushort2 gid [[ thread_position_in_grid ]]) {

    int idx = gid.y * aWidth + gid.x; // Compute absolute index in C
    C[idx] = A[idx] + b[gid.x];

}

Assumption: My understanding is that gid is the position of a single element in C: gid.x is the column, gid.y is the row. If this is not true, someone please correct me.

Now if I fill A with 8 x 8 zeros:

A = 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0

and b like so:

b = 1 2 3 4 5 6 7 8

then after execution, C should be an 8 x 8 matrix where each row is 1 2 3 4 5 6 7 8.

Instead, I get this:

C = 1 2 3 4 5 6 7 8
    1 2 3 4 5 6 7 8
    1 2 3 4 5 6 7 8
    0 0 0 0 0 0 0 0
    1 2 3 4 5 6 7 8
    1 2 3 4 5 6 7 8
    1 2 3 4 5 6 7 8
    0 0 0 0 0 0 0 0

I'm sizing threadgroups according to Apple's recommendation here:

let w = computePipeline.threadExecutionWidth
let h = computePipeline.maxTotalThreadsPerThreadgroup / w
let threadsPerThreadgroup = MTLSizeMake(w, h, 1)
let threadgroupsPerGrid = MTLSize(width: (cWidth + w - 1) / w,
                                  height: (cHeight + h - 1) / h,
                                  depth: 1)

Which on my machine produces (64, 16, 1) threads per group and (1, 1, 1) threadgroups per grid.

However, if I manually set threadsPerThreadgroup to (1, 1, 1) and threadgroupsPerGrid to (8, 8, 1), I get the correct result in C.

The question:

I'm fairly certain that my problem has to do with threadgroup sizing and the way buffers are managed, but I'm fairly new to GPU programming so I don't understand it fully.

Why does decreasing the size of the threadgroups produce the correct result? Even more generally, why does threadgroup sizing have any effect on this computation at all?

It seems to me that if gid always corresponds to an index in C, and I ask for b[gid.x], then that value should be available. I understand that data is generally broken up in order to fit into threadgroup caches - if that's the case here, what rules does it follow and how can I account for it?

Ken Thomases · Accepted Answer

My understanding is that gid is the position of a single element in C: gid.x is the column, gid.y is the row. If this is not true, someone please correct me.

This is not quite true. gid is the position within the grid.

Because the grid is, as it happens, 64x16, the compute function will be called for positions which fall outside of your 8x8 matrices (A and C) and your 8-element vector (b). When that happens, reads of A may access the wrong row or even beyond the end of A. Similarly, reads of b will read beyond its end.

For example, consider when gid is (8, 0). idx will be 8. You'll read A[8], which is actually at (0, 1). You'll read b[8], which is past the end. That's technically undefined but practically quite likely to be 0 for a buffer of that relatively short length. You'll write to C[8] which is also at (0, 1). This is happening roughly simultaneously as the function invocation which is supposed to write at (0, 1) and there's a race as to which prevails.

Your function should, near the beginning, test if gid is out of bounds and, if so, return early:

if (any(gid > aWidth))
    return;

(This assumes that A and C will always be square, so the width and height can be checked against the single value.)

You can try to adjust the computations of threadsPerThreadgroup and threadgroupsPerGrid to get the grid to be exactly the size of your matrices, but that can be tedious to do correctly for all cases. That said, you can certainly keep threadsPerThreadgroup from being too big:

let w = min(computePipeline.threadExecutionWidth, cWidth)
let h = min(computePipeline.maxTotalThreadsPerThreadgroup / w, cHeight)

But you'll still need the check in the compute function because the total grid can still be too large. For example, suppose computePipeline.threadExecutionWidth is at least 8 and computePipeline.maxTotalThreadsPerThreadgroup is 60. Well, w will be 8 but h will be 7. Then, threadgroupsPerGrid will be (1, 2, 1) and the total grid size will be 8x14x1, which is again larger than your matrices.

Accessing GPU buffer by index

Tags:

macos

ios

gpgpu

gpu

metal

The question:

Hundley

1 Answers

Ken Thomases

Recent Activity

Donate For Us

Accessing GPU buffer by index

Tags:

macos

ios

gpgpu

gpu

metal

The question:

Hundley

1 Answers

Ken Thomases

Related questions

Recent Activity

Donate For Us