Are OpenCL work items executed in parallel?

2 Answers

The work items within a group will be scheduled together, and may run together. It is up to the hardware and/or drivers to choose how parallel the execution actually is. There are different reasons for this, but one very good one is to hide memory latency.

On my AMD card, the 'compute units' are divided into 16 4-wide SIMD units. This means that 16 work items can technically be run at the same time in the group. It is recommended that we use multiples of 64 work items in a group, to hide memory latency. Clearly they cannot all be run at the exact time. This is not a problem, because most kernels are in fact, memory bound, so the scheduler (hardware) will swap the work items waiting on the memory controller out, while the 'ready' items get their compute time. The actual number of work items in the group is set by the host program, and limited by CL_DEVICE_MAX_WORK_GROUP_SIZE. You will need to experiment with the optimal work group size for your kernel.

The cpu implementation is 'worse' when it comes to simultaneous work items. There are only ever as many work items running as you have cores available to run them on. They behave more sequentially in the cpu.

So do work items run at the exactly same time? Almost never really. This is why we need to use barriers when we want to be sure they pause at a given point.

answered Sep 20 '22 05:09

mfa

In the (abstract) OpenCL execution model, yes, all work items execute in parallel, and there can be millions of them.

Inside a GPU, all work items of the same work group must be executed on a single "core". This puts a physical restriction on the number of work items per work group (256 or 512 is the max, but it can be smaller for large kernels using a lot of registers). All work groups are then scheduled on the (usually 2 to 16) cores of the GPU.

You can synchronize threads (work items) inside a work group, because they all are resident in the same core, but you can't synchronize threads from different work groups, since they may not be scheduled at the same time, and could be executed on different cores.

Yes, it is possible to have 128 work items inside a work group, unless it consumes too many resources. To reach maximum performance, you usually want to have the largest possible number of threads in a work group (at least 64 are required to hide memory latency, see Vasily Volkov's presentations on this subject).

answered Sep 19 '22 05:09

Eric Bainville

Related questions
                            
                                How to represent scientific notation in C
                            
                                OpenCL CPU Device vs GPU Device
                            
                                Using R's GPU packages on Amazon
                            
                                static openCL class not properly released in python module using boost.python
                            
                                Does Global Work Size Need to be Multiple of Work Group Size in OpenCL?
                            
                                Is private memory slower than local memory?
                            
                                Compile OpenCL on Mingw Nvidia SDK
                            
                                How to read UMat from a file in opencv 3.0 Beta?
                            
                                How to use clang to compile OpenCL to ptx code?
                            
                                Calling OpenCL kernel from another OpenCL kernel
                            
                                How to compile OpenCL on Ubuntu?
                            
                                How to use C++ templates in OpenCL kernels?
                            
                                What should I use instead of cl::KernelFunctor?
                            
                                Custom types in OpenCL kernel
                            
                                Unresolved extern when compiling OpenCL to PTX using Clang?
                            
                                Calculate eigenvalues/eigenvectors of hundreds of small matrices using CUDA
                            
                                Different ways to make kernel
                            
                                Is there a good openCL wrapper for Ruby?
                            
                                How to draw OpenCL calculated pixels to the screen with OpenGL?
                            
                                Get GPU memory usage programmatically

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Are OpenCL work items executed in parallel?

Tags:

opencl

K0n57an71n

People also ask

2 Answers

mfa

Eric Bainville

Recent Activity

Donate For Us