CUDA - how much slower is transferring over PCI-E?

Tags:

If I transfer a single byte from a CUDA kernel to PCI-E to the host (zero-copy memory), how much is it slow compared to transferring something like 200 Megabytes?

What I would like to know, since I know that transferring over PCI-E is slow for a CUDA kernel, is: does it change anything if I transfer just a single byte or a huge amount of data? Or perhaps since memory transfers are performed in "bulks", transferring a single byte is extremely expensive and useless with respect to transferring 200 MBs?

701

asked Jul 18 '13 16:07

Marco A.

1 Answers

Hope this pic explain everything. The data is generated by bandwidthTest in CUDA samples. The hardware environment is PCI-E v2.0, Tesla M2090 and 2x Xeon E5-2609. Please note both axises are in log scale.

Given this figure, we can see that the overhead of launching a transfer request takes a constant time. Regression analysis on the data gives an estimated overhead time of 4.9us for H2D, 3.3us for D2H and 3.0us for D2D.

enter image description here

121

answered Sep 30 '22 14:09

kangshiyin

Related questions
                            
                                Why should I use CUDA __shared__ memory as "extern"
                            
                                Warning when compiling .cu with -Wpedantic: "style of line directive is a GCC extension"
                            
                                Performance of CUDAfy module
                            
                                How do you include standard CUDA libraries to link with NVRTC code?
                            
                                How to add more than one CUDA gencode using modern CMAKE (per target)?
                            
                                What are the "long" and "short" scoreboards w.r.t. MIO/L1TEX?
                            
                                Shared Memory Bank Conflicts in CUDA: How memory is aligned to banks
                            
                                CUDA - what is this loop doing
                            
                                shared memory optimization confusion
                            
                                PyCUDA: Pow within device code tries to use std::pow, fails
                            
                                The peak throughput of cuda Kernel on NVIDA GPU
                            
                                Strange error while using cudaMemcpy: cudaErrorLaunchFailure
                            
                                cuda understanding concurrent kernel execution
                            
                                Pitch alignment for 2D textures
                            
                                Compiling Eigen library with nvcc (CUDA)
                            
                                CUDA result returns garbage using very large array, but reports no error
                            
                                Are cudaMalloc and cudaFree synchronous or asynchronous calls?
                            
                                Reducing matrix rows or columns in CUDA
                            
                                Let nvidia K20c use old stream management way?
                            
                                Surface reference faster than Surface object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

CUDA - how much slower is transferring over PCI-E?

Tags:

cuda

nvidia

pci-e

Marco A.

People also ask

1 Answers

kangshiyin

Recent Activity

Donate For Us