Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is cudaMalloc giving me an error when I know there is sufficient memory space?

Tags:

memory

cuda

I have a Tesla C2070 that is supposed to have 5636554752 bytes of memory.

However, this gives me an error:

int *buf_d = NULL;

err = cudaMalloc((void **)&buf_d, 1000000000*sizeof(int));

if( err != cudaSuccess)
{
     printf("CUDA error: %s\n", cudaGetErrorString(err));
     return EXIT_ERROR;
}

How is this possible? Does this have something to do with the maximum memory pitch? Here are the GPU's specs:

Device 0: "Tesla C2070" 
CUDA Driver Version:    3.20 
CUDA Runtime Version:   3.20 
CUDA Capability Major/Minor version number: 2.0 
Total amount of global memory:  5636554752 bytes 
Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores) 
Total amount of constant memory:    65536 bytes Total amount of shared memory per block:    49152 bytes Total number of registers available per block: 32768 Warp size: 32 
Maximum number of threads per block:    1024 
Maximum sizes of each dimension of a block: 1024 x 1024 x 64 
Maximum sizes of each dimension of a grid:  65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes

As for the machine I'm running on, it has 24 Intel® Xeon® Processor X565, with the Linux distribution Rocks 5.4 (Maverick).

Any ideas? Thanks!

like image 894
Michael Eilers Smith Avatar asked Jan 18 '12 06:01

Michael Eilers Smith


1 Answers

The basic problem is in your question title - you don't actually know that you have sufficient memory, you are assuming you do. The runtime API includes the cudaMemGetInfo function which will return how much free memory there is on the device. When a context is established on a device, the driver must reserved space for device code, local memory for each thread, fifo buffers for printf support, stack for each thread, and heap for in-kernel malloc/new calls (see this answer for further details). All of this can consume rather a lot of memory, leaving you with much less than the maximum avialable memory after ECC reservations you are assuming to be available to your code. The API also includes cudaDeviceGetLimit which you can use to query the amounts of memory that on device runtime support is consuming. There is also a companion call cudaDeviceSetLimit which can allow you to change the amount of memory each component of runtime support will reserve.

Even after you tuned the runtime memory footprint to your tastes and have the actual free memory value from the driver, there is still page size granularity and fragmentation considerations to contend with. Rarely is it possible to allocate every byte of what the API will report as free. Usually, I would do something like this when the objective is to try and allocate every available byte on the card:

const size_t Mb = 1<<20; // Assuming a 1Mb page size here

size_t available, total;
cudaMemGetInfo(&available, &total);

int *buf_d = 0; 
size_t nwords = total / sizeof(int);
size_t words_per_Mb = Mb / sizeof(int);

while(cudaMalloc((void**)&buf_d,  nwords * sizeof(int)) == cudaErrorMemoryAllocation)
{
    nwords -= words_per_Mb;
    if( nwords  < words_per_Mb)
    {
        // signal no free memory
        break;
    }
}

// leaves int buf_d[nwords] on the device or signals no free memory

(note never been near a compiler, only safe on CUDA 3 or later). It is implicitly assumed that none of the obvious sources of problems with big allocations apply here (32 bit host operating system, WDDM windows platform without TCC mode enabled, older known driver issues).

like image 134
talonmies Avatar answered Sep 22 '22 13:09

talonmies