I'm developing an OpenCL 1.2 application that deals with large imagery. At the moment, the image I'm testing with is 16507x21244 pixels. My kernel is run in a loop that operates on chunks of the image. The kernel takes 32bpp (rgba) chunks of the image in, and passes float4-pixel chunks out.
Let's define one side of a (square) chunk in pixels to be the chunk size. That is to say, an 8192x8192 pixel square has chunk size 8192. Of course, on the right and bottom sides we have smaller rectangular chunks if the image is not cleanly divisable by the chunk size. My code handles this, but for the rest of this post, let us ignore this for simplicity.
I am trying to determine the maximum chunk size that I can operate on in each iteration of my loop, as well as the optimal chunk size (which may not be the maximum chunk size).
For reference, here is the information reported by the clinfo utility on my machine. I am running my kernel on the Geforce GTX 560 Ti
with the Nvidia platform using their proprietary linux drivers.
My initial naïve assumption was that I could operate on the max 2d image size.
However, this results in clEnqueueNDRangeKernel
returning an error code of -4 (CL_MEM_OBJECT_ALLOCATION_FAILURE
).
Thinking about it, this makes sense to me. With 1 GiB of video memory, one would expect to be able to hold a single 16384x16384 pixel texture (32bpp), or an 8192x8192 pixel texture (float4). If both need to be cached on the card while the kernel runs, we could expect to use the following amount of memory:
4 bytes-per-pixel * chunk size^2 (input image)
+ 16 bytes-per-pixel * chunk size^2 (output image)
= 1 GiB total video memory
Solving for chunk size we get
chunk size = sqrt(1GiB/20)
Plugging in the amount of memory reported by OpenCL (which is slightly less than 1GiB - 1023 MiB) and flooring the result, we get:
floor(sqrt(1072889856/20)) = 7324
However, a chunk size of 7324 still results in CL_MEM_OBJECT_ALLOCATION_FAILURE
.
My next guess was that we can't pass an image larger than the max allocation size, which OpenCL reports as 268222464 bytes for my card. Because my output image has the larger pixel width, it would dictate my chunk size.
floor(sqrt(268222464/16)) = 4094
Hey, that actually works! Now what if we try to go larger? To my surprise, it doesn't fail. Through trial and error, I narrowed in on 6784 as the actual max chunk size. At 6785, it begins complaining with CL_MEM_OBJECT_ALLOCATION_FAILURE
. I do not know why the max appears to be 6784, and I do not know if this is repeatable or if the value fluctuates (such as other state existing in the video memory affecting how much it can hold.) I also find that running with a chunk size of 6784 is a few seconds slower than running with the size based on max allocation. I wonder if this is because OpenCL is needing to perform multiple (expensive) allocations under-the-hood? I also noticed the "max size of kernel argument" that OpenCL is able to report (CL_DEVICE_MAX_PARAMETER_SIZE
). However, that value seems bogus. If I could only pass 4096 bytes in, that would limit me to 16x16 pixels!
So I'm left with two fundamental questions:
As a bonus question, are there any good resources I could turn to for future questions of this nature regarding low-level OpenCL-hardware interactions?
And finally, I will provide some code snippets for peer-review; I would be extremely grateful for any constructive criticism!
As always, thanks in advance for any help!
To answer your direct questions:
1) To determine the absolute maximum chunk size one can use for a single kernel operation, one must know what 'chunk size' is in reference to. For instance, there are five defined memory models in the OpenCL memory structure. One of which is the the host memory, which we will ignore. The other four are global, constant, local, and private.
To get any information about your hardware in regards to what it can support I strongly suggest going to the Khronos API docs, recorded at the bottom. There is a slew of meta data about your device that you can collect. For instance, there are queries for the max height and max width of an image in 2D and/or 3D that the device can support. I would also suggest taking a look at the CL_DEVICE_LOCAL_MEM_SIZE and CL_DEVICE_MAX_COMPUTE_UNITS to define your work groups. There is even a CL_DEVICE_MAX_MEM_ALLOC_SIZE query that is allowed.
To point out your concern for performance is that the reason the memory size that is given to you for working with is the optimal largest size for a work group or item (depending on the query). What may be happening is a spill over of the memory into the global space. This requires more memory allocations across different workers, causing a performance decrease. Not 100% certain on that statement, but it may very well be part of the issue when you exceed the recommended buffer size.
2) To determine the fastest chunk size trial and error is not needed. In the book "OpenCL Programming Guide" published by Addison-Wesley there is a section on using events for profiling in the host application. There are sets of functions that are allowed to be profiled. These functions are as follows:
To enable this profiling, when once creates a queue, the flag CL_QUEUE_PROFILING_ENABLE needs to be set. Then the function clGetEventProfilingInfo(cl_event event, cl_profiling_info param_name, size_t param_value_size, void* param_value, size_t* param_value_size_ret); can be used to extract the timing data. You can then have the host application do with this data as you please such as:
By using this profiling information you may determine the fastest 'chunk size' via your software or analytically and then using a constant for that chunk size across the board.
Bonus Question Some good resources will be the "OpenCL Programming Guide" published by Addison Wesley, written by Aaftab Munshi, Benedict R. Gaster, Timothy G. Mattson, James Fung, and Dan Ginsburg. I also would like to say that the Khronos docs have a lot of information.
As a side note You are running this kernel inside of a doubly nested loop in the host code... this kind of breaks the entire reason for using parallel programming. Especially on an image. I would suggest refactoring your code and researching parallel programming models for GPU operations. Also do some research on setting up and using Memory Barriers in OpenCL. Intel and Nvidia have some great papers and examples in regards to this. Finally, the API docs are always available
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With