Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA Unified Memory Programming - Can I pass a pointer to modules which are not aware of CUDA?

Tags:

cuda

I will allocate some memory with cudaMallocManaged. Can I safely pass this pointer to another program module (Which is compiled into another .o file) that is not aware of CUDA and will just juse plain memcpy or something to manipulate the data behind the pointer?

Something like that

// compiled into A.o
class A{
  void* getMem(int size){
    void* ptr;
    cudaMallocManaged(*ptr, size);
    return ptr;
  }

  // some kernels here

}


// compiled into B.o
class B{
  void manipulateMem(void* ptr, void* source, int size){
    memcpy(ptr, source, size);
  }

}

And then a code like that, maybe compiled into main.o:

A a;
B b;
void* mem = a.getMem(100);
b.manipulateMem(mem, source, 100);

I did not find any notices that automatic copying/synchronizing should not work in that case.

like image 829
Michael Avatar asked Dec 25 '22 20:12

Michael


2 Answers

Can I safely pass this pointer to another program module (Which is compiled into another .o file) that is not aware of CUDA and will just juse plain memcpy or something to manipulate the data behind the pointer?

Yes, you can, but the Unified Memory access rules (<- hint: click on this and read this) must still be adhered to. For GPUs with compute capability 3.0 or higher but less than 6.0, these rules currently are:

  1. Initially, upon allocation (using a managed allocator such as cudaMallocManaged), and before any kernels have been called, a managed pointer is accessible from (code running on) the host CPU.
  2. Once any kernel has been called, and until a subsequent cudaDeviceSynchronize() has been called, the data is inaccessible from host code, and any attempts to use it in host code will result in UB, which may include seg faults.
  3. After a synchronization event (e.g. cudaDeviceSynchronize()), host access to the data referenced by the managed pointer is restored. (Technically, in the current implementation, host access to data at this point will normally result in page-faults, an OS-visible event. These page faults basically call into the CUDA runtime, which then does the necessary cudaMemcpy operations under the hood, to return the data to the host, to service the page fault. But these comments in parenthesis are not necessary for understanding the general rules of behavior.)

So, the CUDA runtime has basically explicit markers (kernel launch -> synchronize) that explicitly identify to it how to manage the data (when to migrate and in which direction). So even though you have code running in some "module ... that is not aware of CUDA", if the above rules are adhered to, the code in that module will have access to the data, because the CUDA runtime has adequate, explicit markers that identify to it how to manage the data.

For GPUs of compute capability 6.0 or higher, the above rules essentially do not apply. For these GPUs, concurrent access by the host and device is possible, but multiple access to a common resource still presents the possibility for race conditions as in any multi-processor/multi-threaded environment. Currently, the CUDA runtime does not enforce any specific access ordering rules between host and device access to the same page of memory.

like image 157
Robert Crovella Avatar answered Feb 09 '23 01:02

Robert Crovella


Yes, the CUDA unified memory model is there explicitly to allow this kind of access: Not only can the GPU access unified memory pointers directly. In the same way a part of the program running on the host CPU will have unified access including GPU memory. Be aware of bus bandwidth bottlenecks though.

like image 43
datenwolf Avatar answered Feb 09 '23 01:02

datenwolf