For example... Here's what I see in NVIDIA's docs:
cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0);
kernel<<<grid, block>>>(a_d);
cpuFunction();
Let's say this is wrapped in a function...
void consume() {
cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0);
kernel<<<grid, block>>>(a_d);
}
What if I also have a function
void produce() {
// do stuff
a_h[0] = 1;
a_h[1] = 3;
a_h[2] = 5;
//...
}
If I call:
produce();
consume();
produce(); // problem??
The second produce() will start to change the memory on the host at a_h
How do I know that CUDA isn't still reading the host memory during the asynchronous memory copy routine?
How can I safely write to the host a_h
memory without disrupting that asynchronous mem copy?
EDIT---
I know I can call cudaDeviceSynchronize()
or cudaStreamSynchronize()
but that will also wait for kernel
to complete. I would prefer to not wait until kernel
is done.
I want to start writing to host a_h
as soon as possible, while not waiting for kernel
to finish.
Pinned memory is used to speed up a CPU to GPU memory copy operation (as executed by e.g. tensor. cuda() in PyTorch) by ensuring that none of the memory that is to be copied is on disk. Memory cached to disk has to be read into RAM before it can be transferred to the GPU—e.g. it has to be copied twice.
cudaMallocHost: Allocates page-locked memory on the host in duncantl/RCUDA: R Bindings for the CUDA Library for GPU Computing.
cudaDeviceSynchronize() returns any error code of that has occurred in any of those kernels. Note that when a thread calls cudaDeviceSynchronize() , it is not aware which kernel launch constructs has been already executed by other threads in the block.
A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code. While operations within a stream are guaranteed to execute in the prescribed order, operations in different streams can be interleaved and, when possible, they can even run concurrently.
If you use a stream for the cudaMemcpyAsync
call, you can insert an event into the stream after the asynchronous transfer and then use cudaEventSynchronize
to synchronize on that event. This guarantees that the copy has finished, but doesn't rely on the device being idle or the stream being empty.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With