Amount of local memory per CUDA thread

Tags:

I read in NVIDIA documentation (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications, table #12) that the amount of local memory per thread is 512 Ko for my GPU (GTX 580, compute capability 2.0).

I tried unsuccessfully to check this limit on Linux with CUDA 6.5.

Here is the code I used (its only purpose is to test local memory limit, it doesn't make any usefull computation):

#include <iostream>
#include <stdio.h>

#define MEMSIZE 65000  // 65000 -> out of memory, 60000 -> ok

inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=false)
{
    if (code != cudaSuccess) 
    {
        fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if( abort )
            exit(code);
    }
}

inline void gpuCheckKernelExecutionError( const char *file, int line)
{
    gpuAssert( cudaPeekAtLastError(), file, line);
    gpuAssert( cudaDeviceSynchronize(), file, line);    
}


__global__ void kernel_test_private(char *output)
{
    int c = blockIdx.x*blockDim.x + threadIdx.x; // absolute col
    int r = blockIdx.y*blockDim.y + threadIdx.y; // absolute row

    char tmp[MEMSIZE];
    for( int i = 0; i < MEMSIZE; i++)
        tmp[i] = 4*r + c; // dummy computation in local mem
    for( int i = 0; i < MEMSIZE; i++)
        output[i] = tmp[i];
}

int main( void)
{
    printf( "MEMSIZE=%d bytes.\n", MEMSIZE);

    // allocate memory
    char output[MEMSIZE];
    char *gpuOutput;
    cudaMalloc( (void**) &gpuOutput, MEMSIZE);

    // run kernel
    dim3 dimBlock( 1, 1);
    dim3 dimGrid( 1, 1);
    kernel_test_private<<<dimGrid, dimBlock>>>(gpuOutput);
    gpuCheckKernelExecutionError( __FILE__, __LINE__);

    // transfer data from GPU memory to CPU memory
    cudaMemcpy( output, gpuOutput, MEMSIZE, cudaMemcpyDeviceToHost);

    // release resources
    cudaFree(gpuOutput);
    cudaDeviceReset();

    return 0;
}

And the compilation command line:

nvcc -o cuda_test_private_memory -Xptxas -v -O2 --compiler-options -Wall cuda_test_private_memory.cu

The compilation is ok, and reports:

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z19kernel_test_privatePc' for 'sm_20'
ptxas info    : Function properties for _Z19kernel_test_privatePc
    65000 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 21 registers, 40 bytes cmem[0]

I got an "out of memory" error at runtime on the GTX 580 when I reached 65000 bytes per thread. Here is the exact output of the program in the console:

MEMSIZE=65000 bytes.
GPUassert: out of memory cuda_test_private_memory.cu 48

I also did a test with a GTX 770 GPU (on Linux with CUDA 6.5). It ran without error for MEMSIZE=200000, but the "out of memory error" occurred at runtime for MEMSIZE=250000.

How to explain this behavior ? Am I doing something wrong ?

520

asked Mar 02 '15 12:03

devel484

1 Answers

It seems you are running into not a local memory limitation but a stack size limitation:

ptxas info : Function properties for _Z19kernel_test_privatePc

65000 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

The variable that you had intended to be local is on the (GPU thread) stack, in this case.

Based on the information provided by @njuffa here, the available stack size limit is the lesser of:

The maximum local memory size (512KB for cc2.x and higher)
GPU memory/(#of SMs)/(max threads per SM)

Clearly, the first limit is not the issue. I assume you have a "standard" GTX580, which has 1.5GB memory and 16 SMs. A cc2.x device has a maximum of 1536 resident threads per multiprocessor. This means we have 1536MB/16/1536 = 1MB/16 = 65536 bytes stack. There is some overhead and other memory usage that subtracts from the total available memory, so the stack size limit is some amount below 65536, somewhere between 60000 and 65000 in your case, apparently.

I suspect a similar calculation on your GTX770 would yield a similar result, i.e. a maximum stack size between 200000 and 250000.

145

answered Sep 16 '22 20:09

Robert Crovella

Related questions
                            
                                error: "pointer being freed was not allocated" in c
                            
                                Update a java thread's stack size at runtime
                            
                                Java: New Generation Used 100%, Eden Space Used 100%, From Space Used 100%
                            
                                Ruby big array and memory
                            
                                Finding the 3 most recently modified files in a long list of files
                            
                                Do indexes help a mysql MEMORY table?
                            
                                Android counterpart of iOS memory warning
                            
                                How to write register from linux kernel module (cpu: ARM)
                            
                                Find out how much RAM is used by Thread
                            
                                How to find number of conflict misses in a cache simulator
                            
                                Java, reference variables that point to the same object in the memory
                            
                                Size taken by stack frame
                            
                                How to run a dart program with big memory?
                            
                                Holding context, activity or views as member of a class is bad performance?
                            
                                Dynamic memory/realloc string array
                            
                                Memory leak in class Constructor of c++
                            
                                How to identify holder of reference to object in Java Memory Analyzer using heap dump
                            
                                Why do C++ allocators avoid in-place reallocation
                            
                                Sizeof a struct in C
                            
                                Java Object Array null Element Memory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Amount of local memory per CUDA thread

Tags:

memory

cuda

limit

gpu-local-memory

devel484

People also ask

1 Answers

Robert Crovella

Recent Activity

Donate For Us