Number of Compute Units corresponding to the number of work groups

Tags:

I need some clarification. I'm developing OpenCL on my laptop running a small nvidia GPU (310M). When I query the device for CL_DEVICE_MAX_COMPUTE_UNITS, the result is 2. I read the number of work groups for running a kernel should correspond to the number of compute units (Heterogenous Computing with OpenCL, Chapter 9, p. 186), otherwise it would waste too much global memory bandwitdh.

Also the chip is specified to have 16 cuda cores (which correspond to PEs I believe). Does that mean theoretically, the most performant setup for this gpu, regarding global memory bandwith, is to have two work groups with 16 work items each?

660

asked Feb 17 '12 10:02

rdoubleui

2 Answers

While setting the number of work groups to be equal to CL_DEVICE_MAX_COMPUTE_UNITS might be sound advice on some hardware, it certainly is not on NVIDIA GPUs.

On the CUDA architecture, an OpenCL compute unit is the equivalent of a multiprocessor (which can have either 8, 32 or 48 cores at the time of writing), and these are designed to be able to simultaneously run up to 8 work groups (blocks in CUDA) each. At larger input data sizes, you might choose to run thousands of work groups, and your particular GPU can handle up to 65535 x 65535 work groups per kernel launch.

OpenCL has another device attribute CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. If you query that on an NVIDIA device, it will return 32 (this is the "warp", or natural SIMD width of the hardware). That value is the work group size multiple you should use; work group sizes can be up to 512 items each, depending on the resources consumed by each work item. The standard rule of thumb for your particular GPU is that you require at least 192 active work items per compute unit (threads per multiprocessor in CUDA terms) to cover all the latency the architecture and potentially obtain either full memory bandwidth or full arithmetic throughput, depending on the nature of your code.

NVIDIA ship a good document called "OpenCL Programming Guide for the CUDA Architecture" in the CUDA toolkit. You should take some time to read it, because it contains all the specifics of how the NVIDIA OpenCL implementation maps onto the features of their hardware, and it will answer the questions you have raised here.

107

answered Sep 30 '22 14:09

talonmies

I don't even think matching your workgroup count to compute units is a good idea on a CPU. It is better to oversubscribe the cores by several fold. This allows the workload to move around dynamically (in workgroup quanta) as various processors come on line or get distracted with other work. Workgroup count = CL_DEVICE_MAX_COMPUTE_UNITS only really works well on a machine that is doing absolutely nothing else and wasting lots of energy keeping unused cores awake.

answered Sep 30 '22 14:09

Ian Ollmann

Related questions
                            
                                Calling OpenCL kernel from another OpenCL kernel
                            
                                How to compile OpenCL on Ubuntu?
                            
                                How to use C++ templates in OpenCL kernels?
                            
                                What should I use instead of cl::KernelFunctor?
                            
                                Custom types in OpenCL kernel
                            
                                Unresolved extern when compiling OpenCL to PTX using Clang?
                            
                                Calculate eigenvalues/eigenvectors of hundreds of small matrices using CUDA
                            
                                Different ways to make kernel
                            
                                Is there a good openCL wrapper for Ruby?
                            
                                How to draw OpenCL calculated pixels to the screen with OpenGL?
                            
                                Get GPU memory usage programmatically
                            
                                Are OpenCL work items executed in parallel?
                            
                                Can I run Cuda or opencl on intel iris?
                            
                                OpenCL: Work items, Processing elements, NDRange
                            
                                Keep getting CL_INVALID_KERNEL_ARGS on nvidia gpu
                            
                                How To Structure Large OpenCL Kernels?
                            
                                How to use async_work_group_copy in OpenCL?
                            
                                When will OpenCL 1.2 for NVIDIA hardware be available?
                            
                                HPC programming language relying on implicit vectorization
                            
                                Using Delphi to take advantage of GPGPU technology?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Number of Compute Units corresponding to the number of work groups

Tags:

nvidia

simd

opencl

rdoubleui

People also ask

2 Answers

talonmies

Ian Ollmann

Recent Activity

Donate For Us