Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Device to device copy in Vulkan

Tags:

c++

gpu

vulkan

I want to copy an image/buffer between two GPUs/physical devices in my Vulkan application (one vkInstance, two vkDevices). Is this possible without staging the image on the CPU or is there a feature like CUDA p2p? How would this look?

If staging on the host is required, what would be the optimal method for this?

like image 694
slocook Avatar asked Oct 03 '18 16:10

slocook


1 Answers

is there a feature like CUDA p2p?

Vulkan 1.1 supports the concept of device groups to cover this situation. It allows you to treat a set of physical devices as a single logical device, and also lets you query how memory can be manipulated within the device group, as well as do things like allocate memory on a subset of devices. Check the specifications for the full set of functionality.

Is this possible without staging the image on the CPU

If your devices don't support the extenson VK_KHR_device_group, then no. You must transfer the content through the CPU and system memory.

Since buffers are per-device, you would need two host-visible staging buffers, one for the read operation, and another for the write operation. You'll also need two queues, two command buffers, etc, etc...

You'll have to execute 3 operations with manual synchronization.

  • On the source GPU execute a copy from the device-local buffer to the host visible buffer for the same device.

  • On the CPU copy from the source GPU host visible buffer to the target GPU host-visible buffer

  • On the target GPU copy from the host-visible buffer to the device-local buffer

Make sure to inspect your device queue family properties and if possible use a queue from a queue family that is marked as transfer capable but not graphics or compute capable. The fewer flags a Vulkan queue family has, the better suited it is to the operations that it does have flags for. Most modern discrete GPUs have dedicated transfer queues, but again, queues are specific to devices, so you'll need to be interacting with one queue for each device to execute the transfer.

If staging on the host is required, what would be the optimal method for this?

Exactly how to execute this depends on your use case. If you want to execute the whole thing synchronously in a single thread, then you'll just be doing a bunch of submits and then waiting on fences. If you want to do it asynchronously in the background while you continue to render frames, then you'll still be doing the submits, but you'll have to non-blocking checking on the fences to see when operations complete before you move to the next part.

If you're transferring buffers there's probably nothing to be worried about in terms of optimal transfer, but if you're dealing with images then you have to get into the whole linear vs optimal image tiling mess. In order to avoid that I'd suggest using host visible buffers for staging, regardless of whether you're transferring images or buffers, and as such use vkCmdCopyImageToBuffer and vkCmdCopyBufferToImage to do the transfers between device-local and host-visible memory

like image 93
Jherico Avatar answered Sep 16 '22 12:09

Jherico