cudaMemcpy & blocking

Tags:

cuda

I'm confused by some comments I've seen about blocking and cudaMemcpy. It is my understanding that the Fermi HW can simultaneously execute kernels and do a cudaMemcpy.

I read that Lib func cudaMemcpy() is a blocking function. Does this mean the func will block further execution until the copy has has fully completed? OR Does this mean the copy won't start until the previous kernels have finished?

e.g. Does this code provide the same blocking operation?

SomeCudaCall<<<25,34>>>(someData);
cudaThreadSynchronize();

SomeCudaCall<<<25,34>>>(someParam);
cudaMemcpy(toHere, fromHere, sizeof(int), cudaMemcpyHostToDevice);

671

asked Jul 23 '12 19:07

Doug

1 Answers

Your examples are equivalent. If you want asynchronous execution you can use streams or contexts and cudaMemcpyAsync, so that you can overlap execution with copy.

172

answered Nov 11 '22 23:11

perreal

Related questions
                            
                                Could not locate deviceQuery on my installation Cuda toolkit v7.5 on Windows 10
                            
                                How to Get CUDA Toolkit Version at Compile Time Without nvcc?
                            
                                How to remove all PTX from compiled CUDA to prevent Intellectual Property leaks
                            
                                How to convert CUDA clock cycles to milliseconds?
                            
                                CUDA Device To Device transfer expensive
                            
                                CUDA streams and context
                            
                                Is there a good way use a read only hashmap on cuda?
                            
                                Dealing with large switch statements in CUDA
                            
                                Multi-GPU profiling (Several CPUs , MPI/CUDA Hybrid)
                            
                                How many grids in CUDA
                            
                                GTX 680 , Keplers and maximum registers per thread
                            
                                Scaling in inverse FFT by cuFFT
                            
                                CUDA pow function with integer arguments
                            
                                QR decomposition to solve linear systems in CUDA
                            
                                task scheduling of NVIDIA GPU
                            
                                grid_group not found in CUDA 9
                            
                                Can't find CUDA_INCLUDE_DIRS in latest CMAKE [duplicate]
                            
                                Type Qualifiers for a device class in CUDA
                            
                                Dynamic Allocating memory on GPU
                            
                                Openmp thread divergence?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With