Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concurrent programming in OpenCL vs Grand Central Dispatch

With the introduction of OpenCL 2.0, OpenCL seems to have many of the features of Grand Central Dispatch (GCD), such as CLang/Apple style blocks and queues. Looking at their respective feature sets, I am wondering if OpenCL can do everything the GCD/libdispatch can do, but with the added ability of directing the computation to the GPU as well as the CPU — or if GCD has more to offer that stands apart from OpenCL.

Specifically my questions are:

  1. What differentiates the concurrency features of GCD and OpenCL?

  2. If there is value in using them together (assuming GCD offers added features), can C blocks be routed to either GCD queues or OpenCL queues? And if targeting the CPU, would there be a reason to pass through OpenCL vs running directly

  3. Does OpenCL 2.0 offer GCD style load balancing that can fill threads spanning both the CPU and GPU?

like image 709
Troy Harvey Avatar asked Mar 20 '23 17:03

Troy Harvey


2 Answers

In its current form, OpenCL is able to address both data-parallel and task parallel problems, as evidenced by the different OpenCL API primitives to enqueue work:

  • clEnqueuNDRangeKernel : used to enqueue a kernel with an N-dimensional workgroup size. Typically used for data-parallel processing.
  • clEnqueueTask: used to enqueue a kernel that consists of a single work item. This primitive, used for task-parallel execution, is essentially equivalent to clEnqueueNDRangeKernel with a global work size of 1. It was removed from the OpenCL 2.0 spec
  • clEnqueueNativeKernel enqueues a native C/C++ function to a device (if it supports native kernels), the benefit being that you can use the queueing mechanism (and ordering in the queue) and also access buffer data from the OpenCL context directly. Other than that this is very similar to the concept of a thread or task.

So while OpenCL clearly has its origins in data-parallel processing on GPUs and is still most suitable to process data that can somehow be forced in a 1, 2 or three-dimensional grid, task-oriented processing is also possible. The benefits become more apparent once one starts targeting heterogeneous systems with multiple CPUs, GPUs, DSPs and accelerators as OpenCL can target all those devices with one formalism.

GCD on the other hand provides a convenient library that relieves the developer from much of the burden of managing and scheduling tasks by building on the concept of queues (of different types and priorities). Using GCD can therefore result in less error prone and more compact code on symmetric multiprocessing systems.

So while OpenCL and GCD have different backgrounds at its origin (besides the fact that they both came from Apple), they both use queues at their foundation to manage work item distribution. They both have a concept of "context" to define data access.

Since OS X version 10.7 it is possible to use GCD to dispatch OpenCL kernels (similar to blocks) to devices that support OpenCL, opening the potential for combining the benefits/strengths of both OpenCL and GCD.

Here is an attempt to some answers/insights w.r.t. your specific questions:

1 - What differentiates the concurrency features of GCD and OpenCL?

As pointed out by @Dithermaster and @sharpneli, GCD originally targeted task-oriented (symmetric multi-) processing while OpenCL initially was meant for data-parallel processing on heterogeneous architectures.

One main difference between the OpenCL and GCD is in the queuing mechanism. For example, while both OpenCL and GCD support synchronous and asynchronous execution, GCD has three priorities for the global asynchronous queues. The OpenCL runtime doesn't have this (it has out-of-order execution of work-items, but it is not defined which work items the runtime will execute first).

The GCD manuals also indicate that a task is more lightweight than a traditional thread, making it possible to spawn a GCD task an much less instructions than a thread.

Another difference is in the memory consistency model used. OpenCL uses a relaxed model for kernels, with global, local, private and constant memory. GCD does not have this.

On the other hand, OpenCL has vector data types and vector intrinsics allowing for directly tapping in the SIMD potential of an architecture without relying on the compiler. On some architectures this is beneficial while other architectures (like MIC) recommend not to vectorize manually.

Finally - not really a concurrency feature though - OpenCL has functions that allow to read and write image types, essentially giving you direct access to texture memory. This can often be used to obtain significant speedups even for algorithms unrelated to image processing.

2 - If there is value in using them together (assuming GCD offers added features), can C blocks be routed to either GCD queues or OpenCL queues? And if targeting the CPU, would there be a reason to pass through OpenCL vs running directly

By using GCD and OpenCL together you are able to address any device that supports OpenCL. So you are able to use the potentially heterogeneous nature of your platform while still being able to benefit from the higher level mechanisms that GCD provides to make multi-threading easier. Writing everything using the OpenCL C API (or even the C++ API) would likely result in slightly more code.

Additionally GCD provides primitives like gcl_get_kernel_block_workgroup_info which can recommend an optimal work group size for your kernel.

However, in my understanding it is not possible to route arbitrary C blocks to either GCD or OpenCL queues. C blocks can only go to non-OpenCL queues. OpenCL kernels can only be dispatched (from the host side) to queues for devices that support OpenCL. From the device side (so from within an OpenCL kernel) a block can only be dispatched to the same device.

3 - Does OpenCL 2.0 offer GCD style load balancing that can fill threads spanning both the CPU and GPU?

No, OpenCL 2.0 does not really define how to do load balancing, neither from the point of view of the host, nor from the device.

However, on the host side one could easily split up a computation and run part of it say on the CPU, and another part on one or more GPUs. One way to do this is by playing with the work group sizes and the work group offset and only copying the minimum data needed for each device. Autotuning mechanisms can then be used to figure out what is the best load balancing between the different devices used.

Once OpenCL 2.0 drivers become available, the newly introduced pipes, dynamic parallelism and shared virtual memory will give more possibilities for efficiently partitioning work between devices. It is not clear if and how these features will become available through GCD.

like image 200
Erik Duymelinck Avatar answered Apr 06 '23 22:04

Erik Duymelinck


1) Already answered well by Dithermaster. OpenCL is for when your problem parallelizes very well and you have a lot of data. GCD is for when you have to effortlessly spawn a thread to handle file IO or whatnot. You can never ever call any system or other library functions from OpenCL.

2) If your problem is easily parallelizable it's worth to use OpenCL even on CPU. As an example Intel's OpenCL implementation manages to horizontally parallelize some kernels so that a single core effectively runs 8 threads at once (one thread "runs" in one vector component of SSE register).

You cannot pass generic C blocks to OpenCL. Just the kernel itself and nothing more. OpenCL is way more picky on what you can execute on it. The only thing the newfangled Clang IR brings is the ability to avoid distributing the kernel source in text form. It's still limited to just OpenCL kernel code.

3) Not automatically. Some implementations do exist which perform this. Last year GDC Intel demonstrated automatic load balancing on their CPU (It used the integrated GPU and the CPU at the same time).

OpenCL2.0 does not really care on what sort of HW it runs on, so it's going to be the responsibility of HW manufacturers to implement such functionality to their platforms. Or for disjoint platforms it's the headache of the programmer.

like image 34
sharpneli Avatar answered Apr 06 '23 21:04

sharpneli