Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieving Results from Kernel

Tags:

metal

I am experiencing some hiccups while playing around with the kernel function.

What I would like to do is simply send an array to the function and the get results back after waitUntilCompleted in an array.

The following is an array that will be filled with numbers from 0 to 123455 after malloc in a loop:

float *myVector = malloc(123456 * sizeof(float));

Here is the array, along with myVector, that will be sent to the kernel:

float *resultData =  malloc(123456 * sizeof(float));
id <MTLBuffer> inBuffer = [device newBufferWithBytes:&myVector[0] length:sizeof(myVector) options:MTLResourceOptionCPUCacheModeDefault];
id <MTLBuffer> buffer = [device newBufferWithBytes:&resultData[0] length:sizeof(resultData) options:MTLResourceOptionCPUCacheModeDefault];

Using compute command encoder, they are both set to be at index 0, 1 with offset 0, respectively.

The following sets up the sizes for thread group and threads within a group:

MTLSize threadGroupCounts = MTLSizeMake([device maxThreadsPerThreadgroup].width, 1, 1);
MTLSize threadGroups = MTLSizeMake((123456) / threadGroupCounts.width, 1, 1);

[commandEncoder dispatchThreadgroups:threadGroups threadsPerThreadgroup:threadGroupCounts];

[commandEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];

I am getting twice the following error:

Execution of the command buffer was aborted due to an error during execution. Caused GPU Hang Error (IOAF code 3)

After having spent many hours on it, I came to the conclusion that the error is caused by the following lines:

MTLSize threadGroupCounts = MTLSizeMake([device maxThreadsPerThreadgroup].width, 1, 1);
MTLSize threadGroups = MTLSizeMake((123456) / [device maxThreadsPerThreadgroup].width, 1, 1);

If I set, just as an example, (123456) / [device maxThreadsPerThreadgroup].widthto be 32, the error would not occur but the results would be all zeroes except for the first 2 values within the array.

Here is how I try to get the results after processing:

NSData *data = [NSData dataWithBytesNoCopy:buffer.contents length:sizeof(myVector) freeWhenDone:NO];
float *finalArray = malloc(sizeof(float) * 123456);
[data getBytes:&finalArray[0] length:sizeof(finalArray)];

Here is the function:

kernel void test(const device float *inVector [[buffer (0)]],
                 device float *outVector [[buffer (1)]],
                 uint id [[thread_position_in_grid]])
{
    outVector[id] = -inVector[id]; 
}

I think I am having trouble with setting up the thread sizes. As a test, what I am trying to achieve is to set up the maximum threads per thread group allowed, dividing the size of the array by this number and send it off to process. Could someone show me how to set up thread group sizes, send an array to the function and eventually retrieve the results within an array correctly and properly?

Thanks.

like image 598
Unheilig Avatar asked Nov 08 '16 22:11

Unheilig


1 Answers

There's an error with how you're computing the size of your MTLBuffers. Because myVector is a pointer, sizeof(myVector) is probably 8, not 493824. That, in turn, causes you to not allocate enough space for your data, and to read beyond the bounds of the buffer in the kernel function. Try using the same size when creating the buffer as when allocating the float array with malloc, and see if that helps.

You'll need to make a corresponding change to how many bytes you retrieve from the output buffer with getBytes:length:.

I think the way you're computing your threadgroup size and count is reasonable, though you should be aware of integer truncation. If the total number of elements to be processed isn't evenly divisible by the threadgroup size, the way you're computing the threadgroup count will round down, causing you to skip some elements.

One way to avoid this is to round up the number of threadgroups you dispatch instead, and explicitly check against the buffer length to prevent out-of-bounds accesses. So you'd compute your threadgroup count and size like this:

const int elementCount = 123456;
MTLSize threadgroupSize = MTLSizeMake([device maxThreadsPerThreadgroup].width, 1, 1);
MTLSize threadgroups = MTLSizeMake(ceil(elementCount / (float)threadgroupSize.width), 1, 1);

...pass in the buffer size like this:

[computeCommandEncoder setBytes:&elementCount length:sizeof(elementCount) atIndex:2];

...and check against the bounds like this:

kernel void test(const device float *inVector [[buffer (0)]],
                 device float *outVector [[buffer (1)]],
                 constant int &elementCount [[buffer (2)]],
                 uint id [[thread_position_in_grid]])
{
    if (id < elementCount) {
        outVector[id] = -inVector[id];
    }
}
like image 74
warrenm Avatar answered Oct 11 '22 16:10

warrenm