I have a Tesla C2070 that is supposed to have 5636554752 bytes of memory.
However, this gives me an error:
int *buf_d = NULL;
err = cudaMalloc((void **)&buf_d, 1000000000*sizeof(int));
if( err != cudaSuccess)
{
printf("CUDA error: %s\n", cudaGetErrorString(err));
return EXIT_ERROR;
}
How is this possible? Does this have something to do with the maximum memory pitch? Here are the GPU's specs:
Device 0: "Tesla C2070"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5636554752 bytes
Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores)
Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
As for the machine I'm running on, it has 24 Intel® Xeon® Processor X565, with the Linux distribution Rocks 5.4 (Maverick).
Any ideas? Thanks!
The basic problem is in your question title - you don't actually know that you have sufficient memory, you are assuming you do. The runtime API includes the cudaMemGetInfo
function which will return how much free memory there is on the device. When a context is established on a device, the driver must reserved space for device code, local memory for each thread, fifo buffers for printf
support, stack for each thread, and heap for in-kernel malloc
/new
calls (see this answer for further details). All of this can consume rather a lot of memory, leaving you with much less than the maximum avialable memory after ECC reservations you are assuming to be available to your code. The API also includes cudaDeviceGetLimit
which you can use to query the amounts of memory that on device runtime support is consuming. There is also a companion call cudaDeviceSetLimit
which can allow you to change the amount of memory each component of runtime support will reserve.
Even after you tuned the runtime memory footprint to your tastes and have the actual free memory value from the driver, there is still page size granularity and fragmentation considerations to contend with. Rarely is it possible to allocate every byte of what the API will report as free. Usually, I would do something like this when the objective is to try and allocate every available byte on the card:
const size_t Mb = 1<<20; // Assuming a 1Mb page size here
size_t available, total;
cudaMemGetInfo(&available, &total);
int *buf_d = 0;
size_t nwords = total / sizeof(int);
size_t words_per_Mb = Mb / sizeof(int);
while(cudaMalloc((void**)&buf_d, nwords * sizeof(int)) == cudaErrorMemoryAllocation)
{
nwords -= words_per_Mb;
if( nwords < words_per_Mb)
{
// signal no free memory
break;
}
}
// leaves int buf_d[nwords] on the device or signals no free memory
(note never been near a compiler, only safe on CUDA 3 or later). It is implicitly assumed that none of the obvious sources of problems with big allocations apply here (32 bit host operating system, WDDM windows platform without TCC mode enabled, older known driver issues).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With