The data rate of cudaMemcpy
operations is heavily influenced by the number of PCI-e 3.0 (or 2.0) lanes that are allocated to run from the CPU to GPU. I'm curious about how PCI-e lanes are used on Nvidia devices containing two GPUs.
Nvidia has a few products that have two GPUs on a single PCI-e device. For example:
As with many newer graphics cards, these devices mount in PCI-e 16
slots. For cards that contain only one GPU, the GPU can use 16 PCI-e lanes.
If I have a device containing two GPUs (like the GTX 690), but I'm only running compute jobs on just one of the GPUs, can all 16 PCI-e lanes serve the one GPU that is being utilized?
To show this as ascii art...
[ GTX690 (2x GF110) ] ------16 PCI-e lanes ----- [ CPU ]
I'm not talking about the case where the CPU is connected to two cards that have one GPU each. (like the following diagram)
[ GTX670 (1x GK104) ] ------ PCI-e lanes ----- [ CPU ] ------ PCI-e lanes ----- [ GTX670 (1x GK104) ]
The GTX 690 uses a PLX PCIe Gen 3 bridge chip to connect the two GK104 GPUs with the host PCIe bus. There is a full x16 connection from the host to the PLX device, and from the PLX device to each GPU (the PLX device has a total of 48 lanes). Therefore, if only using one GPU, you can achieve approximately full x16 bandwidth to that GPU. You can explore this by using the bandwidthTest that is included in the CUDA samples. bandwidthTest will target a single GPU (of the two that are on the card, and this is selectable via command line option), and you should see approximately full bandwidth depending on the system. If your system is Gen3 capable, you should see full PCIe x16 Gen 3 bandwidth (don't forget to use --memory=pinned option), which will vary depending on the specific system but should be well north of 6GB/s (probably in the 9-11GB/s range). If your system is Gen2 capable, you should see something in the 4-6GB/s range. A similar statement can be made about GTX 590, however it is a Gen2 only device and uses a different bridge chip. The results of bandwidthTest confirm that a full x16 logical path exists between the root port and either GPU. There is no free lunch of course, so you cannot get simultaneous full bandwidth to both GPUs: you are limited by the x16 slot.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With