When I run my CUDA program which allocates only a small amount of global memory (below 20 M), I got a "out of memory" error. (From other people's posts, I think the problem is related to memory fragmentation) I try to understand this problem, and realize I have a couple of questions related to CUDA memory management. <ol> <li>Is there a virtual memory concept in CUDA? </li> <li>If only one kernel is allowed to run on CUDA simultaneously, after its termination, will all of the memory it used or allocated released? If not, when these memory got free released? </li> <li>If more than one kernel are allowed to run on CUDA, how can they make sure the memory they use do not overlap?</li> </ol> Can anyone help me answer these questions? Thanks Edit 1: operating system: x86_64 GNU/Linux CUDA version: 4.0 Device: Geforce 200, It is one of the GPUS attached to the machine, and I don't think it is a display device. Edit 2: The following is what I got after doing some research. Feel free to correct me. <ol> <li>CUDA will create one context for each host thread. This context will keep information such as what portion of memory (pre allocated memory or dynamically allocated memory) has been reserved for this application so that other application can not write to it. When this application terminates (not kernel) , this portion of memory will be released.</li> <li>CUDA memory is maintained by a link list. When an application needs to allocate memory, it will go through this link list to see if there is continuous memory chunk available for allocation. If it fails to find such a chunk, a "out of memory" error will report to the users even though the total available memory size is greater than the requested memory. And that is the problem related to memory fragmentation.</li> <li>cuMemGetInfo will tell you how much memory is free, but not necessarily how much memory you can allocate in a maximum allocation due to memory fragmentation.</li> <li>On Vista platform (WDDM), GPU memory virtualization is possible. That is, multiple applications can allocate almost the whole GPU memory and WDDM will manage swapping data back to main memory. </li> </ol> New questions: 1. If the memory reserved in the context will be fully released after the application has been terminated, memory fragmentation should not exist. There must be some kind of data left in the memory. 2. Is there any way to restructure the GPU memory ?

<ol> <li>GPU off-chip memory is separated in global, local and constant memory. This three memory types are a virtual memory concept. Global memory is free for all threads, local is just for one thread only (mostly used for register spilling) and constant memory is cached global memory (writable only from host code). Have a look at 5.3.2 from the CUDA C Programming Guide.</li> <li>EDIT: removed</li> <li>Memory allocated via <code>cudaMalloc</code> does never overlap. For the memory a kernel allocates during runtime should be enough memory available. If you are out of memory and try to start a kernel (only a guess from me) you should get the "unknown error" error message. The driver than was unable to start and/or executes the kernel. </li> </ol>

How is CUDA memory managed?

Tags:

cuda

gpu

nvidia

When I run my CUDA program which allocates only a small amount of global memory (below 20 M), I got a "out of memory" error. (From other people's posts, I think the problem is related to memory fragmentation) I try to understand this problem, and realize I have a couple of questions related to CUDA memory management.

Is there a virtual memory concept in CUDA?
If only one kernel is allowed to run on CUDA simultaneously, after its termination, will all of the memory it used or allocated released? If not, when these memory got free released?
If more than one kernel are allowed to run on CUDA, how can they make sure the memory they use do not overlap?

Can anyone help me answer these questions? Thanks

Edit 1: operating system: x86_64 GNU/Linux CUDA version: 4.0 Device: Geforce 200, It is one of the GPUS attached to the machine, and I don't think it is a display device.

Edit 2: The following is what I got after doing some research. Feel free to correct me.

CUDA will create one context for each host thread. This context will keep information such as what portion of memory (pre allocated memory or dynamically allocated memory) has been reserved for this application so that other application can not write to it. When this application terminates (not kernel) , this portion of memory will be released.
CUDA memory is maintained by a link list. When an application needs to allocate memory, it will go through this link list to see if there is continuous memory chunk available for allocation. If it fails to find such a chunk, a "out of memory" error will report to the users even though the total available memory size is greater than the requested memory. And that is the problem related to memory fragmentation.
cuMemGetInfo will tell you how much memory is free, but not necessarily how much memory you can allocate in a maximum allocation due to memory fragmentation.
On Vista platform (WDDM), GPU memory virtualization is possible. That is, multiple applications can allocate almost the whole GPU memory and WDDM will manage swapping data back to main memory.

New questions: 1. If the memory reserved in the context will be fully released after the application has been terminated, memory fragmentation should not exist. There must be some kind of data left in the memory. 2. Is there any way to restructure the GPU memory ?

373

asked Dec 30 '11 22:12

xhe8

2 Answers

The device memory available to your code at runtime is basically calculated as

Free memory =   total memory                - display driver reservations                - CUDA driver reservations               - CUDA context static allocations (local memory, constant memory, device code)               - CUDA context runtime heap (in kernel allocations, recursive call stack, printf buffer, only on Fermi and newer GPUs)               - CUDA context user allocations (global memory, textures)

if you are getting an out of memory message, then it is likely that one or more of the first three items is consuming most of the GPU memory before your user code ever tries to get memory in the GPU. If, as you have indicated, you are not running on a display GPU, then the context static allocations are the most likely source of your problem. CUDA works by pre-allocating all the memory a context requires at the time the context is established on the device. There are a lot of things which get allocated to support a context, but the single biggest consumer in a context is local memory. The runtime must reserve the maximum amount of local memory which any kernel in a context will consume for the maximum number of threads which each multiprocessor can run simultaneously, for each multiprocess on the device. This can run into hundreds of Mb of memory if a local memory heavy kernel is loaded on a device with a lot of multiprocessors.

The best way to see what might be going on is to write a host program with no device code which establishes a context and calls cudaMemGetInfo. That will show you how much memory the device has with the minimal context overhead on it. Then run you problematic code, adding the same cudaMemGetInfo call before the first cudaMalloc call that will then give you the amount of memory your context is using. That might let you get a handle of where the memory is going. It is very unlikely that fragmentation is the problem if you are getting failure on the first cudaMalloc call.

answered Oct 07 '22 18:10

talonmies

GPU off-chip memory is separated in global, local and constant memory. This three memory types are a virtual memory concept. Global memory is free for all threads, local is just for one thread only (mostly used for register spilling) and constant memory is cached global memory (writable only from host code). Have a look at 5.3.2 from the CUDA C Programming Guide.
EDIT: removed
Memory allocated via cudaMalloc does never overlap. For the memory a kernel allocates during runtime should be enough memory available. If you are out of memory and try to start a kernel (only a guess from me) you should get the "unknown error" error message. The driver than was unable to start and/or executes the kernel.

answered Oct 07 '22 18:10

Michael Haidl

Related questions
                            
                                Allocate 2D Array on Device Memory in CUDA
                            
                                What is the difference between PyCUDA and NumbaPro CUDA Python?
                            
                                Does CPU waits for DEVICE to let it finish its kernel execution....?
                            
                                Texture memory in CUDA: Concept and simple example to demonstrate performance
                            
                                error: cuda_runtime.h: No such file or directory
                            
                                what is difference between "-arch sm_13" and "-arch sm_20"
                            
                                CUDA: __syncthreads() inside if statements
                            
                                Concurrent writes in the same global memory location
                            
                                How to calculate Gflops of a kernel
                            
                                Why is "a =(b>0)?1:0" better than "if-else" version in CUDA?
                            
                                Gustafson's law vs Amdahl's law
                            
                                NVIDIA CUDA Video Encoder (NVCUVENC) input from device texture array
                            
                                Nvcc missing when installing cudatoolkit?
                            
                                CUDA vs OpenCL performance comparison
                            
                                CUDA: Tiled matrix-matrix multiplication with shared memory and matrix size which is non-multiple of the block size
                            
                                From thrust::device_vector to raw pointer and back?
                            
                                NVidia CUDA toolkit 7.5.27 failing to install on OS X
                            
                                Difference with CUDA Hardware Quadro 4000 Vs. GeForce 480
                            
                                Have you successfully used a GPGPU? [closed]
                            
                                help me understand cuda

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With