Proper way to inform OpenCL kernels of many memory objects?

Tags:

In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers?

The buffers themselves are stable throughout the life of the application -- that is, we will allocate the buffers at application start, call multiple kernels, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.

In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.

What would be the best way to do this in OpenCL? It seems that CL's global variables are a bit different than CUDA's, but I can't find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what's the best way otherwise?

204

asked Jun 16 '12 11:06

int3h

1 Answers

60 global variables sure is a lot! Are you sure there isn't a way to refactor your algorithm a bit to use smaller data chunks? Remember, each kernel should be a minimum work unit, not something colossal!

However, there is one possible solution. Assuming your 60 arrays are of known size, you could store them all into one big buffer, and then use offsets to access various parts of that large array. Here's a very simple example with three arrays:

A is 100 elements
B is 200 elements
C is 100 elements

big_array = A[0:100] B[0:200] C[0:100]
offsets = [0, 100, 300]

Then, you only need to pass big_array and offsets to your kernel, and you can access each array. For example:

A[50] = big_array[offsets[0] + 50]
B[20] = big_array[offsets[1] + 20]
C[0] = big_array[offsets[2] + 0]

I'm not sure how this would affect caching on your particular device, but my initial guess is "not well." This kind of array access is a little nasty, as well. I'm not sure if it would be valid, but you could start each of your kernels with some code that extracts each offset and adds it to a copy of the original pointer.

On the host side, in order to keep your arrays more accessible, you can use clCreateSubBuffer: http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubBuffer.html which would also allow you to pass references to specific arrays without the offsets array.

I don't think this solution will be any better than passing the 60 kernel arguments, but depending on your OpenCL implementation's clSetKernelArgs, it might be faster. It will certainly reduce the length of your argument list.

199

answered Jan 03 '23 01:01

Ryan Marcus

Related questions
                            
                                What kind of work benifits from OpenCL
                            
                                Dividing 1 by a huge integer [closed]
                            
                                Meaning of following syntax of cuda Kernel
                            
                                What is the fastest way to memset() a GPU buffer with OpenCL?
                            
                                OpenCL float sum reduction
                            
                                Is it possible to write OpenCL kernels in C++ rather than C?
                            
                                How to effectively swap OpenCL memory buffers?
                            
                                OpenCL SDK Linux Download for INTEL GPU
                            
                                Using #include to load OpenCL code
                            
                                Use of OpenACC over OpenCL?
                            
                                Releasing Opencl Memory, Kernels, Devices etc
                            
                                Write multiple kernels or a Single kernel
                            
                                Embed V8 in OpenCL application?
                            
                                CL_MEM_ALLOC_HOST_PTR slower than CL_MEM_USE_HOST_PTR
                            
                                Can this parallelism be implemented in OpenCL
                            
                                Could NOT find OpenCL (missing: OpenCL_LIBRARY)
                            
                                How does OpenCL work without ICD loader extension?
                            
                                OpenCL: Is possible to use templated objects as kernel arguments with Boost::compute?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Proper way to inform OpenCL kernels of many memory objects?

Tags:

gpgpu

opencl

int3h

People also ask

1 Answers

Ryan Marcus

Recent Activity

Donate For Us