I am a little bit confused about how exactly zero-copy work. 1- Want to confirm that the following corresponds to zero-copy in opencl. <pre class="prettyprint"><code> ....................... . . . . . . . . CPU . . SYSTEM . . . RAM . c3 X . . <=====> . ...|................... PCI-E / / | / / c2 |X /PCI-E, CPU directly accessing GPU memory | / / copy c3, c2 is avoided, indicated by X. ...|...././................ . MEMORY<====> . . OBJECT .c1 . . . GPU . . GPU RAM . . . . . ........................... ....................... . . . . . . . . CPU . .SYSTEM RAM . . . . . . . c3 . . MEMORY<====> . ...| OBJECT............ | \ \ PCI-E \ \PCI-E, GPU directly accessing System memory. copy c2, c1 is avoided | \ \ C2 |X \ \ ...|.........\..\........... . | . . . <=======> . . GPU c1 X GPU . . RAM . . . . . ............................ </code></pre> The GPU/CPU is accessing System/GPU-RAM directly, without explicit copy. 2-What is the advantage of having this? PCI-e is still limiting the over all bandwidth. Or the only advantage is that we can avoid copies c2 & c1/c3 in above situations?

You are correct in your understanding of how zero-copy works. The basic premise is that you can access either the host memory from the device, or the device memory from the host without needing to do an intermediate buffering step in between. You can perform zero-copy by creating buffers with the following flags: <pre class="prettyprint"><code>CL_MEM_AMD_PERSISTENT_MEM //Device-Resident Memory CL_MEM_ALLOC_HOST_PTR // Host-Resident Memory </code></pre> Then, the buffers can be accessed using memory mapping semantics: <pre class="prettyprint"><code>void* p = clEnqueueMapBuffer(queue, buffer, CL_TRUE, CL_MAP_WRITE, 0, size, 0, NULL, NULL, &err); //Perform writes to the buffer p err = clEnqueueUnmapMemObject(queue, buffer, p, 0, NULL, NULL); </code></pre> Using zero-copy you could be able to achieve performance over an implementation that did the following: <ol> <li>Copy a file to a host buffer</li> <li>Copy buffer to the device</li> </ol> Instead you could do it all in one step <ol> <li>Memory Map device side buffer</li> <li>Copy file from host to device</li> <li>Unmap memory</li> </ol> On some implementations, the calls of mapping and unmapping can hide the cost of data transfer. As in our example, <ol> <li>Memory Map device side buffer [Actually creates a host-side buffer of the same size]</li> <li>Copy file from host to device [Actually writes to the host-side buffer]</li> <li>Unmap memory [Actually copies data from host-buffer to device-buffer via clEnqueueWriteBuffer]</li> </ol> If the implementation is performing this way, then there will be no benefit to using the mapping approach. However, AMDs newer drivers for OpenCL allow the data to be written directly, making the cost of mapping and unmapping almost 0. For discrete graphics cards, the requests still take place over the PCIe bus, so data transfers can be slow. In the case of an APU architecture, however, the costs of data transfers using the zero-copy semantics can greatly increase the speed of transfers due to the APUs unique architecture (pictured below). In this architecture, the PCIe bus is replaced with the Unified North Bridge (UNB) that allows for faster transfers. BE AWARE that when using zero-copy semantics with the memory-mapping, that you will see absolutely horrendous bandwidths when reading a device-side buffer from the host. These bandwidths are on the order of 0.01 Gb/s and can easily become a new bottleneck for your code. Sorry if this is too much information. This was my thesis topic. <img src="https://i.stack.imgur.com/pfFA3.png" alt="APU Architecture">

Access Path in Zero-Copy in OpenCL

Tags:

opencl

I am a little bit confused about how exactly zero-copy work.

1- Want to confirm that the following corresponds to zero-copy in opencl.

 .......................
 .           .         .  
 .           .         .
 .           . CPU     . 
 .   SYSTEM  .         .
 .    RAM    . c3 X    .  
 .         <=====>     .  
 ...|...................
   PCI-E     / /
    |       / /
 c2 |X     /PCI-E, CPU directly accessing GPU memory
    |     / /                          copy c3, c2 is avoided, indicated by X. 
 ...|...././................
 .   MEMORY<====>          .
 .   OBJECT  .c1           . 
 .           .     GPU     .
 .   GPU RAM .             .  
 .           .             .  
 ...........................




 .......................
 .           .         .  
 .           .         .
 .           .   CPU   . 
 .SYSTEM RAM .         .
 .           .         .
 .           . c3      .  
 .    MEMORY<====>     .           
 ...| OBJECT............
    |     \  \   
   PCI-E   \  \PCI-E, GPU directly accessing System memory.  copy c2, c1 is avoided
    |       \  \
 C2 |X       \  \
 ...|.........\..\...........
 .  |        .              .
 .       <=======>          . 
 .   GPU    c1 X   GPU      .
 .   RAM     .              .  
 .           .              .  
 ............................

The GPU/CPU is accessing System/GPU-RAM directly, without explicit copy.

2-What is the advantage of having this? PCI-e is still limiting the over all bandwidth. Or the only advantage is that we can avoid copies c2 & c1/c3 in above situations?

407

asked Oct 07 '12 06:10

gpuguy

1 Answers

You are correct in your understanding of how zero-copy works. The basic premise is that you can access either the host memory from the device, or the device memory from the host without needing to do an intermediate buffering step in between.

You can perform zero-copy by creating buffers with the following flags:

CL_MEM_AMD_PERSISTENT_MEM //Device-Resident Memory
CL_MEM_ALLOC_HOST_PTR // Host-Resident Memory

Then, the buffers can be accessed using memory mapping semantics:

void* p = clEnqueueMapBuffer(queue, buffer, CL_TRUE, CL_MAP_WRITE, 0, size, 0, NULL, NULL, &err);
//Perform writes to the buffer p
err = clEnqueueUnmapMemObject(queue, buffer, p, 0, NULL, NULL);

Using zero-copy you could be able to achieve performance over an implementation that did the following:

Copy a file to a host buffer
Copy buffer to the device

Instead you could do it all in one step

Memory Map device side buffer
Copy file from host to device
Unmap memory

On some implementations, the calls of mapping and unmapping can hide the cost of data transfer. As in our example,

Memory Map device side buffer [Actually creates a host-side buffer of the same size]
Copy file from host to device [Actually writes to the host-side buffer]
Unmap memory [Actually copies data from host-buffer to device-buffer via clEnqueueWriteBuffer]

If the implementation is performing this way, then there will be no benefit to using the mapping approach. However, AMDs newer drivers for OpenCL allow the data to be written directly, making the cost of mapping and unmapping almost 0. For discrete graphics cards, the requests still take place over the PCIe bus, so data transfers can be slow.

In the case of an APU architecture, however, the costs of data transfers using the zero-copy semantics can greatly increase the speed of transfers due to the APUs unique architecture (pictured below). In this architecture, the PCIe bus is replaced with the Unified North Bridge (UNB) that allows for faster transfers.

BE AWARE that when using zero-copy semantics with the memory-mapping, that you will see absolutely horrendous bandwidths when reading a device-side buffer from the host. These bandwidths are on the order of 0.01 Gb/s and can easily become a new bottleneck for your code.

Sorry if this is too much information. This was my thesis topic.

APU Architecture

answered Nov 24 '22 16:11

KLee1

Related questions
                            
                                Better way to load vectors from memory. (clang)
                            
                                Intel OpenCL SDK installation on ubuntu 14.04
                            
                                How to have Apache Spark running on GPU?
                            
                                Call multiple times get_global_id() vs save the result in the local variable?
                            
                                What is the difference between clEnqueueBarrier and clFinish?
                            
                                Possible to run OpenCL program at low priority (be "nice")?
                            
                                Synchronizations in GPUs
                            
                                AMD equivalent of the CUDA Driver API?
                            
                                OpenCL scalar vs vector
                            
                                Get modulo from two 4x64bit integer arrays
                            
                                How many tasks can be executed simultaneously on GPU device?
                            
                                OpenCL local memory size and number of compute units
                            
                                Resizing images (jpeg or decompressed image)
                            
                                What is a constant address space qualifier in OpenCL?
                            
                                How to compile OpenCL kernel into bitstream?
                            
                                opencl matrix library
                            
                                GPU reads from CPU or CPU writes to the GPU?
                            
                                OpenCL: Device / Host memory coherence for variables passed to kernel with CL_MEM_USE_HOST_PTR
                            
                                openCL reduction, and passing 2d array
                            
                                Boost.Compute slower than plain CPU?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With