OpenCL: work group concept

Tags:

opencl

I don't really understand the purpose of Work-Groups in OpenCL.

I understand that they are a group of Work Items (supposedly, hardware threads), which ones get executed in parallel.

However, why is there this need of coarser subdivision ? Wouldn't it be OK to have only the grid of threads (and, de facto, only one W-G)?

Should a Work-Group exactly map to a physical core ? For example, the TESLA c1060 card is said to have 240 cores. How would the Work-Groups map to this??

Also, as far as I understand, work-items inside a work group can be synchronized thanks to memory fences. Can work-groups synchronize or is that even needed ? Do they talk to each other via shared memory or is this only for work items (not sure on this one)?

201

asked Nov 07 '14 15:11

Carmellose

2 Answers

Part of the confusion here I think comes down to terminology. What GPU people often call cores, aren't really, and what GPU people often call threads are only in a certain sense.

Cores A core, in GPU marketing terms may refer to something like a CPU core, or it may refer to a single lane of a SIMD unit - in effect a single core x86 CPU would be four cores of this simpler type. This is why GPU core counts can be so high. It isn't really a fair comparison, you have to divide by 16, 32 or a similar number to get a more directly comparable core count.

Work-items Each work-item in OpenCL is a thread in terms of its control flow, and its memory model. The hardware may run multiple work-items on a single thread, and you can easily picture this by imagining four OpenCL work-items operating on the separate lanes of an SSE vector. It would simply be compiler trickery that achieves that, and on GPUs it tends to be a mixture of compiler trickery and hardware assistance. OpenCL 2.0 actually exposes this underlying hardware thread concept through sub-groups, so there is another level of hierarchy to deal with.

Work-groups Each work-group contains a set of work-items that must be able to make progress in the presence of barriers. In practice this means that it is a set, all of whose state is able to exist at the same time, such that when a synchronization primitive is encountered there is little overhead in switching between them and there is a guarantee that the switch is possible.

A work-group must map to a single compute unit, which realistically means an entire work-group fits on a single entity that CPU people would call a core - CUDA would call it a multiprocessor (depending on the generation), AMD a compute unit and others have different names. This locality of execution leads to more efficient synchronization, but it also means that the set of work-items can have access to locally constructed memory units. They are expected to communicate frequently, or barriers wouldn't be used, and to make this communication efficient there may be local caches (similar to a CPU L1) or scratchpad memories (local memory in OpenCL).

As long as barriers are used, work-groups can synchronize internally, between work-items, using local memory, or by using global memory. Work-groups cannot synchronize with each other and the standard makes no guarantees on forward progress of work-groups relative to each other, which makes building portable locking and synchronization primitives effectively impossible.

A lot of this is due to history rather than design. GPU hardware has long been designed to construct vector threads and assign them to execution units in a fashion that optimally processes triangles. OpenCL falls out of generalising that hardware to be useful for other things, but not generalising it so much that it becomes inefficient to implement.

168

answered Sep 20 '22 06:09

Lee

There are already alot of good answers, for further understanding of the terminology of OpenCL this paper ("An Introduction to the OpenCL Programming Model" by Jonathan Tompson and Kristofer Schlachter) actually describes all the concepts very well.

answered Sep 22 '22 06:09

chutsu

Related questions
                            
                                How to wait until all child processes called by fork() complete?
                            
                                Why do you have to use both a compiler flag and a run-time flag to get multicore-support in Haskell?
                            
                                Multiple Threads reading from the same file
                            
                                Why is concurrent haskell non deterministic while parallel haskell primitives (par and pseq) deterministic?
                            
                                Saving time and memory using parfor?
                            
                                help me understand cuda
                            
                                Parallelism in Python
                            
                                How to pause/resume all threads in an ExecutorService in Java?
                            
                                Best Practices for cache locality in Multicore Parallelism in F#
                            
                                Is Celery as efficient on a local system as python multiprocessing is?
                            
                                RxJava vs Java 8 Parallelism Stream
                            
                                Dask: How would I parallelize my code with dask delayed?
                            
                                increment a count value outside parallel.foreach scope
                            
                                Using CUDA with Visual Studio 2017
                            
                                Should thread-safe class have a memory barrier at the end of its constructor?
                            
                                How do the C++ STL (ExecutionPolicy) algorithms determine how many parallel threads to use?
                            
                                In Python, how do I know when a process is finished?
                            
                                How to make all AJAX calls sequential?
                            
                                Why are Asynchronous processes not called Synchronous?
                            
                                Error in unserialize(socklist[[n]]) : error reading from connection on Unix

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With