Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA - how much slower is transferring over PCI-E?

Tags:

cuda

nvidia

pci-e

If I transfer a single byte from a CUDA kernel to PCI-E to the host (zero-copy memory), how much is it slow compared to transferring something like 200 Megabytes?

What I would like to know, since I know that transferring over PCI-E is slow for a CUDA kernel, is: does it change anything if I transfer just a single byte or a huge amount of data? Or perhaps since memory transfers are performed in "bulks", transferring a single byte is extremely expensive and useless with respect to transferring 200 MBs?

like image 701
Marco A. Avatar asked Jul 18 '13 16:07

Marco A.


People also ask

How fast can PCIe transfer data?

Summary of PCI Express Interface Parameters: Data Rate: PCIe 3.0 = 1000MB/s, PCIe 2.0 = 500MB/s, PCIe 1.1 = 250MB/s. Total Bandwidth: (x16 link): PCIe 3.0 = 32GB/s, PCIe 2.0 = 16GB/s, PCIe 1.1 = 8GB/s. Data Transfer Rate: PCIe 3.0 = 8.0GT/s, PCIe 2.0= 5.0GT/s, PCIe 1.1 = 2.5GT/s.

How do you optimize data transfers in Cuda?

The GPU cannot access data directly from pageable host memory, so when a data transfer from pageable host memory to device memory is invoked, the CUDA deriver must first allocate a temporary page-locked, or (pinned) host array, copy the host data to the pinned array, and then transfer the data from the pinned array to ...

What is pinned memory Cuda?

– Pinned memory are virtual memory pages that are specially marked so that. they cannot be paged out. – Allocated with a special system API function call. – a.k.a. Page Locked Memory, Locked Pages, etc. – CPU memory that serve as the source or destination of a DMA transfer must.

Which function is used to transfer data from source to destination in Cuda?

cudaMemcpy() − This API function is used for memory data transfer. It requires four parameters as input: Pointer to the destination, pointer to the source, amount of data to be copied (in bytes), and the direction of transfer.


1 Answers

Hope this pic explain everything. The data is generated by bandwidthTest in CUDA samples. The hardware environment is PCI-E v2.0, Tesla M2090 and 2x Xeon E5-2609. Please note both axises are in log scale.

Given this figure, we can see that the overhead of launching a transfer request takes a constant time. Regression analysis on the data gives an estimated overhead time of 4.9us for H2D, 3.3us for D2H and 3.0us for D2D.

enter image description here

like image 121
kangshiyin Avatar answered Sep 30 '22 14:09

kangshiyin