Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Proper way to inform OpenCL kernels of many memory objects?

Tags:

gpgpu

opencl

In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers?

The buffers themselves are stable throughout the life of the application -- that is, we will allocate the buffers at application start, call multiple kernels, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.

In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.

What would be the best way to do this in OpenCL? It seems that CL's global variables are a bit different than CUDA's, but I can't find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what's the best way otherwise?

like image 204
int3h Avatar asked Jun 16 '12 11:06

int3h


People also ask

What is kernel in OpenCL?

A kernel is essentially a function written in the OpenCL language that enables it to be compiled for execution on any device that supports OpenCL. The kernel is the only way the host can call a function that will run on a device. When the host invokes a kernel, many work items start running on the device.

What is work group OpenCL?

A "work group" is a 1, 2 or 3 dimensional set of threads within the thread hierarchy and contains a set of "work items," and each of these work items maps to a "core" in a GPU. When using SYCL with an OpenCL device, the "work group" size often dictates the occupancy of the compute units.

What is a host program in OpenCL?

In the OpenCL paradigm, a "host program" is the outer control logic that performs the configuration for a GPU-based application. This host program normally would run on a general purpose CPU (such as the x86-compatible main processor in most desktop PCs).


1 Answers

60 global variables sure is a lot! Are you sure there isn't a way to refactor your algorithm a bit to use smaller data chunks? Remember, each kernel should be a minimum work unit, not something colossal!

However, there is one possible solution. Assuming your 60 arrays are of known size, you could store them all into one big buffer, and then use offsets to access various parts of that large array. Here's a very simple example with three arrays:

A is 100 elements
B is 200 elements
C is 100 elements

big_array = A[0:100] B[0:200] C[0:100]
offsets = [0, 100, 300]

Then, you only need to pass big_array and offsets to your kernel, and you can access each array. For example:

A[50] = big_array[offsets[0] + 50]
B[20] = big_array[offsets[1] + 20]
C[0] = big_array[offsets[2] + 0]

I'm not sure how this would affect caching on your particular device, but my initial guess is "not well." This kind of array access is a little nasty, as well. I'm not sure if it would be valid, but you could start each of your kernels with some code that extracts each offset and adds it to a copy of the original pointer.

On the host side, in order to keep your arrays more accessible, you can use clCreateSubBuffer: http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubBuffer.html which would also allow you to pass references to specific arrays without the offsets array.

I don't think this solution will be any better than passing the 60 kernel arguments, but depending on your OpenCL implementation's clSetKernelArgs, it might be faster. It will certainly reduce the length of your argument list.

like image 199
Ryan Marcus Avatar answered Jan 03 '23 01:01

Ryan Marcus