I will allocate some memory with cudaMallocManaged
. Can I safely pass this pointer to another program module (Which is compiled into another .o file) that is not aware of CUDA and will just juse plain memcpy or something to manipulate the data behind the pointer?
Something like that
// compiled into A.o
class A{
void* getMem(int size){
void* ptr;
cudaMallocManaged(*ptr, size);
return ptr;
}
// some kernels here
}
// compiled into B.o
class B{
void manipulateMem(void* ptr, void* source, int size){
memcpy(ptr, source, size);
}
}
And then a code like that, maybe compiled into main.o:
A a;
B b;
void* mem = a.getMem(100);
b.manipulateMem(mem, source, 100);
I did not find any notices that automatic copying/synchronizing should not work in that case.
Can I safely pass this pointer to another program module (Which is compiled into another .o file) that is not aware of CUDA and will just juse plain memcpy or something to manipulate the data behind the pointer?
Yes, you can, but the Unified Memory access rules (<- hint: click on this and read this) must still be adhered to. For GPUs with compute capability 3.0 or higher but less than 6.0, these rules currently are:
cudaMallocManaged
), and before any kernels have been called, a managed pointer is accessible from (code running on) the host CPU.cudaDeviceSynchronize()
has been called, the data is inaccessible from host code, and any attempts to use it in host code will result in UB, which may include seg faults.cudaDeviceSynchronize()
), host access to the data referenced by the managed pointer is restored. (Technically, in the current implementation, host access to data at this point will normally result in page-faults, an OS-visible event. These page faults basically call into the CUDA runtime, which then does the necessary cudaMemcpy
operations under the hood, to return the data to the host, to service the page fault. But these comments in parenthesis are not necessary for understanding the general rules of behavior.)So, the CUDA runtime has basically explicit markers (kernel launch -> synchronize) that explicitly identify to it how to manage the data (when to migrate and in which direction). So even though you have code running in some "module ... that is not aware of CUDA", if the above rules are adhered to, the code in that module will have access to the data, because the CUDA runtime has adequate, explicit markers that identify to it how to manage the data.
For GPUs of compute capability 6.0 or higher, the above rules essentially do not apply. For these GPUs, concurrent access by the host and device is possible, but multiple access to a common resource still presents the possibility for race conditions as in any multi-processor/multi-threaded environment. Currently, the CUDA runtime does not enforce any specific access ordering rules between host and device access to the same page of memory.
Yes, the CUDA unified memory model is there explicitly to allow this kind of access: Not only can the GPU access unified memory pointers directly. In the same way a part of the program running on the host CPU will have unified access including GPU memory. Be aware of bus bandwidth bottlenecks though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With