How to use async_work_group_copy in OpenCL?

Question

I would like to understand how to correctly use the async_work_group_copy() call in OpenCL. Let's have a look on a simplified example:

__kernel void test(__global float *x) {
  __local xcopy[GROUP_SIZE];

  int globalid = get_global_id(0);
  int localid = get_local_id(0);
  event_t e = async_work_group_copy(xcopy, x+globalid-localid, GROUP_SIZE, 0);
  wait_group_events(1, &e);
}

The reference http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/async_work_group_copy.html says "Perform an async copy of num_elements gentype elements from src to dst. The async copy is performed by all work-items in a work-group and this built-in function must therefore be encountered by all work-items in a workgroup executing the kernel with the same argument values; otherwise the results are undefined."

But that doesn't clarify my questions...

I would like to know, if the following assumptions are correct:

The call to async_work_group_copy() must be executed by all work-items in the group.
The call should be in a way, that the source address is identical for all work-items and points to the first element of the memory area to be copied.
As my source address is relative based on the global work-item id of the first work-item in the work-group. So I have to subtract the local id to have the address identical for all work-items...
Is the third parameter really the number of elements (not the size in bytes)?

Bonus questions:

a. Can I just use barrier(CLK_LOCAL_MEM_FENCE) instead of wait_group_events() and ignore the return value? If so, would that be probably faster?

b. Does a local copy also make sense for processing on CPUs or is that overhead as they share a cache anyway?

Regards, Stefan

mfa · Accepted Answer

One of the main reasons for this function existing is to allow the driver/kernel compiler to efficiently copy the memory without the developer having to make assumptions about the hardware.

You describe what memory you need copied as if it were a single-threaded copy, and async_work_group_copy gets it done for you using the parallel hardware.

For your specific questions:

I have never seen async_work_group_copy used by only some of the work items in a group. I always assumed this is because it it required. I think the blocking nature of wait_group_events forces all work items to be part of the copy.
Yes. Source (and destination) addresses need to be the same for all work items.
You could subtract your local id to get the correct address, but I find that basing the address on groupId solves this problem as well. (get_group_id)
Yes. The last param is the number of elements, not the size in bytes.

a. No. The event-based you will find that your barrier is hit almost immediately by the work items, and the data won't necessarily be copied. This makes sense because some opencl hardware might not even use the compute units at all to do the actual copy operation.

b. I think that cpu opencl implementations might guarantee L1 cache usage when you use local memory. The only way to know for sure if this performs better is to benchmark your application with various settings.

How to use async_work_group_copy in OpenCL?

Tags:

opencl

SDwarfs

1 Answers

mfa

Recent Activity

Donate For Us

How to use async_work_group_copy in OpenCL?

Tags:

opencl

SDwarfs

1 Answers

mfa

Related questions

Recent Activity

Donate For Us