Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are OpenCL kernels executed asynchronously?

For CUDA, I know they are executed asynchronously after issuing the launch commands to the default stream(null stream), so how about that in OpenCL? Sample codes are as follows:

cl_context context;
cl_device_id device_id;
cl_int err;
...
cl_kernel kernel1;
cl_kernel kernel2;
cl_command_queue Q = clCreateCommandQueue(context, device_id, 0, &err);
...
size_t global_w_offset[3] = {0,0,0};
size_t global_w_size[3] = {16,16,1};
size_t local_w_size[3] = {16,16,1};
err = clEnqueueNDRangeKernel(Q, kernel1, 3, global_w_offset, global_w_size, 
                             local_w_size, 0, nullptr, nullptr);
err = clEnqueueNDRangeKernel(Q, kernel2, 3, global_w_offset, global_w_size, 
                             local_w_size, 0, nullptr, nullptr);
clFinish(Q);

Will kernel1 and kernel2 be executed asynchronously after commands enqueued?(i.e. executions overlap)

Update
According to the OpenCL Reference, It seems set properties as CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE in clCreateCommandQueue can meet my need. But, Does out_of_order mean asynchronization?

like image 969
Finley Avatar asked Mar 26 '26 15:03

Finley


2 Answers

Does out_of_order mean asynchronization

"Out of order" queue means kernels may execute in different order than they were queued (if their event/data dependencies allow it). They also may execute concurrently, but not necessarily.

Also, asynchronous execution means something else than execution overlap (that's called parallel execution or concurrency). Asynchronous execution means that kernel code on device executes independently of host code - which is always true in OpenCL.

The simple way to get concurrency (execution overlap) is by using >1 queues on the same device. This works even on implementations which don't have Out-of-order queue capability. It does not guarantee execution overlap (because OpenCL can be used on much more devices than CUDA, and on some devices you simply can't execute >1 kernel at a time), but in my experience with most GPUs you should get at least some overlap. You need to be careful about buffers used by kernels in separate queues, though.

like image 173
mogu Avatar answered Mar 29 '26 05:03

mogu


In your current code:

err = clEnqueueNDRangeKernel(Q, kernel1, 3, global_w_offset, global_w_size, 
                             local_w_size, 0, nullptr, nullptr);
err = clEnqueueNDRangeKernel(Q, kernel2, 3, global_w_offset, global_w_size, 
                             local_w_size, 0, nullptr, nullptr);

kernel1 finishes first and then kernel2 is executed

Using


clCreateCommandQueue(context, device_id, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);

you can execute multiple different kernels concurrently though it isn't guranteed.

Beware though, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is not supported in all OpenCL implementations. This also means that you have no guarantee that kernel1 will finish execution before kernel2. If any objects that are output by kernel1 are required as input in kernel2, it may fail.

Also multiple command queues can be created and enqueued with commands and the reason for their existence is because the problem you wish to solve might involve some, if not all of the heterogeneous devices in the host. And they could represent independent streams of computation where no data is shared, or dependent streams of computation where each subsequent task depends on the previous task (often, data is shared). However, these command queues will execute on the device without synchronization, provided that no data is shared. If data is shared, then the programmer needs to ensure synchronization of the data using synchronization commands in the OpenCL specification.

like image 43
Waqar Avatar answered Mar 29 '26 03:03

Waqar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!