How CUDA constant memory allocation works?

Tags:

I'd like to get some insight about how constant memory is allocated (using CUDA 4.2). I know that the total available constant memory is 64KB. But when is this memory actually allocated on the device? Is this limit apply to each kernel, cuda context or for the whole application?

Let's say there are several kernels in a .cu file, each using less than 64K constant memory. But the total constant memory usage is more than 64K. Is it possible to call these kernels sequentially? What happens if they are called concurrently using different streams?

What happens if there is a large CUDA dynamic library with lots of kernels each using different amounts of constant memory?

What happens if there are two applications each requiring more than half of the available constant memory? The first application runs fine, but when will the second app fail? At app start, at cudaMemcpyToSymbol() calls or at kernel execution?

236

asked Jan 23 '13 15:01

hthms

1 Answers

Parallel Thread Execution ISA Version 3.1 section 5.1.3 discusses constant banks.

Constant memory is restricted in size, currently limited to 64KB which can be used to hold statically-sized constant variables. There is an additional 640KB of constant memory, organized as ten independent 64KB regions. The driver may allocate and initialize constant buffers in these regions and pass pointers to the buffers as kernel function parameters. Since the ten regions are not contiguous, the driver must ensure that constant buffers are allocated so that each buffer fits entirely within a 64KB region and does not span a region boundary.

A simple program can be used to illustrate the use of constant memory.

__constant__ int    kd_p1;
__constant__ short  kd_p2;
__constant__ char   kd_p3;
__constant__ double kd_p4;

__constant__ float kd_floats[8];

__global__ void parameters(int p1, short p2, char p3, double p4, int* pp1, short* pp2, char* pp3,     double* pp4)
{
    *pp1 = p1;
    *pp2 = p2;
    *pp3 = p3;
    *pp4 = p4;
    return;
}

__global__ void constants(int* pp1, short* pp2, char* pp3, double* pp4)
{
    *pp1 = kd_p1;
    *pp2 = kd_p2;
    *pp3 = kd_p3;
    *pp4 = kd_p4;
    return;
}

Compile this for compute_30, sm_30 and execute cuobjdump -sass <executable or obj> to disassemble you should see

Fatbin elf code:
================
arch = sm_30
code version = [1,6]
producer = cuda
host = windows
compile_size = 32bit
identifier = c:/dev/constant_banks/kernel.cu

    code for sm_30
            Function : _Z10parametersiscdPiPsPcPd
    /*0008*/     /*0x10005de428004001*/     MOV R1, c [0x0] [0x44];        // stack pointer
    /*0010*/     /*0x40001de428004005*/     MOV R0, c [0x0] [0x150];       // pp1
    /*0018*/     /*0x50009de428004005*/     MOV R2, c [0x0] [0x154];       // pp2
    /*0020*/     /*0x0001dde428004005*/     MOV R7, c [0x0] [0x140];       // p1
    /*0028*/     /*0x13f0dc4614000005*/     LDC.U16 R3, c [0x0] [0x144];   // p2
    /*0030*/     /*0x60011de428004005*/     MOV R4, c [0x0] [0x158];       // pp3
    /*0038*/     /*0x70019de428004005*/     MOV R6, c [0x0] [0x15c];       // pp4
    /*0048*/     /*0x20021de428004005*/     MOV R8, c [0x0] [0x148];       // p4
    /*0050*/     /*0x30025de428004005*/     MOV R9, c [0x0] [0x14c];       // p4
    /*0058*/     /*0x1bf15c0614000005*/     LDC.U8 R5, c [0x0] [0x146];    // p3
    /*0060*/     /*0x0001dc8590000000*/     ST [R0], R7;                   // *pp1 = p1
    /*0068*/     /*0x0020dc4590000000*/     ST.U16 [R2], R3;               // *pp2 = p2
    /*0070*/     /*0x00415c0590000000*/     ST.U8 [R4], R5;                // *pp3 = p3
    /*0078*/     /*0x00621ca590000000*/     ST.64 [R6], R8;                // *pp4 = p4
    /*0088*/     /*0x00001de780000000*/     EXIT;
    /*0090*/     /*0xe0001de74003ffff*/     BRA 0x90;
    /*0098*/     /*0x00001de440000000*/     NOP CC.T;
    /*00a0*/     /*0x00001de440000000*/     NOP CC.T;
    /*00a8*/     /*0x00001de440000000*/     NOP CC.T;
    /*00b0*/     /*0x00001de440000000*/     NOP CC.T;
    /*00b8*/     /*0x00001de440000000*/     NOP CC.T;
            ...........................................


            Function : _Z9constantsPiPsPcPd
    /*0008*/     /*0x10005de428004001*/     MOV R1, c [0x0] [0x44];       // stack pointer
    /*0010*/     /*0x00001de428004005*/     MOV R0, c [0x0] [0x140];      // p1
    /*0018*/     /*0x10009de428004005*/     MOV R2, c [0x0] [0x144];      // p2
    /*0020*/     /*0x0001dde428004c00*/     MOV R7, c [0x3] [0x0];        // kd_p1
    /*0028*/     /*0x13f0dc4614000c00*/     LDC.U16 R3, c [0x3] [0x4];    // kd_p2
    /*0030*/     /*0x20011de428004005*/     MOV R4, c [0x0] [0x148];      // p3
    /*0038*/     /*0x30019de428004005*/     MOV R6, c [0x0] [0x14c];      // p4
    /*0048*/     /*0x20021de428004c00*/     MOV R8, c [0x3] [0x8];        // kd_p4
    /*0050*/     /*0x30025de428004c00*/     MOV R9, c [0x3] [0xc];        // kd_p4
    /*0058*/     /*0x1bf15c0614000c00*/     LDC.U8 R5, c [0x3] [0x6];     // kd_p3
    /*0060*/     /*0x0001dc8590000000*/     ST [R0], R7;
    /*0068*/     /*0x0020dc4590000000*/     ST.U16 [R2], R3;
    /*0070*/     /*0x00415c0590000000*/     ST.U8 [R4], R5;
    /*0078*/     /*0x00621ca590000000*/     ST.64 [R6], R8;
    /*0088*/     /*0x00001de780000000*/     EXIT;
    /*0090*/     /*0xe0001de74003ffff*/     BRA 0x90;
    /*0098*/     /*0x00001de440000000*/     NOP CC.T;
    /*00a0*/     /*0x00001de440000000*/     NOP CC.T;
    /*00a8*/     /*0x00001de440000000*/     NOP CC.T;
    /*00b0*/     /*0x00001de440000000*/     NOP CC.T;
    /*00b8*/     /*0x00001de440000000*/     NOP CC.T;
            .....................................

I annotated to the right of the SASS.

On sm30 you can see that parameters are passed in constant bank 0 starting at offset 0x140.

User defined __constant__ variables are defined in constant bank 3.

If you execute cuobjdump --dump-elf <executable or obj> you can find other interesting constant information.

32bit elf: abi=6, sm=30, flags = 0x1e011e
Sections:
Index Offset   Size ES Align   Type   Flags Link     Info Name
    1     34    142  0  1    STRTAB       0    0        0 .shstrtab
    2    176    19b  0  1    STRTAB       0    0        0 .strtab
    3    314     d0 10  4    SYMTAB       0    2        a .symtab
    4    3e4     50  0  4 CUDA_INFO       0    3        b .nv.info._Z9constantsPiPsPcPd
    5    434     30  0  4 CUDA_INFO       0    3        0 .nv.info
    6    464     90  0  4 CUDA_INFO       0    3        a .nv.info._Z10parametersiscdPiPsPcPd
    7    4f4    160  0  4  PROGBITS       2    0        a .nv.constant0._Z10parametersiscdPiPsPcPd
    8    654    150  0  4  PROGBITS       2    0        b .nv.constant0._Z9constantsPiPsPcPd
    9    7a8     30  0  8  PROGBITS       2    0        0 .nv.constant3
    a    7d8     c0  0  4  PROGBITS       6    3  a00000b .text._Z10parametersiscdPiPsPcPd
    b    898     c0  0  4  PROGBITS       6    3  a00000c .text._Z9constantsPiPsPcPd

.section .strtab

.section .shstrtab

.section .symtab
 index     value     size      info    other  shndx    name
   0          0        0        0        0      0     (null)
   1          0        0        3        0      a     .text._Z10parametersiscdPiPsPcPd
   2          0        0        3        0      7     .nv.constant0._Z10parametersiscdPiPsPcPd
   3          0        0        3        0      b     .text._Z9constantsPiPsPcPd
   4          0        0        3        0      8     .nv.constant0._Z9constantsPiPsPcPd
   5          0        0        3        0      9     .nv.constant3
   6          0        4        1        0      9     kd_p1
   7          4        2        1        0      9     kd_p2
   8          6        1        1        0      9     kd_p3
   9          8        8        1        0      9     kd_p4
  10         16       32        1        0      9     kd_floats
  11          0      192       12       10      a     _Z10parametersiscdPiPsPcPd
  12          0      192       12       10      b     _Z9constantsPiPsPcPd

The kernel parameter constant bank is versioned per launch so that concurrent kernels can be executed. The compiler and user constants are per CUmodule. It is the responsibility of the developer to manage coherency of this data. For example, the developer has to ensure that a cudaMemcpyToSymbol is update in a safe manner.

177

answered Oct 01 '22 14:10

Greg Smith

Related questions
                            
                                InputMethodManager holds reference to the tabhost - Memory Leak - OOM Error
                            
                                Optimistic reads and Locking STM (Software Transactional Memory) with C/C++
                            
                                Tools to analyzing the memory footprint of native DLLs and assemblies loaded in a process?
                            
                                Multiple sites on Django
                            
                                NodeJS in-memory cache with memory pressure awareness
                            
                                How to calculate memory usage as Task Manager does?
                            
                                What does "Private Data" define in VMMAP?
                            
                                Postgres gets out of memory errors despite having plenty of free memory
                            
                                Total memory used by Java process and heap size
                            
                                How to maximize DDR3 memory data transfer rate?
                            
                                Where are static class variables stored in memory?
                            
                                Practical limitations of JVM memory and CPU usage?
                            
                                zlib, deflate: How much memory to allocate?
                            
                                does jQuery .html() method leak memory?
                            
                                Why do heap memory usage and number of loaded classes keep increasing?
                            
                                Tomcat process killed by Linux kernel after running out of swap space; don't get any JVM OutOfMemory error
                            
                                Help understanding Windows memory - "Working Set"
                            
                                How big is the stack memory for a certain program, and are there any compiler flags that can set it?
                            
                                Anonymous mmap zero-filled?
                            
                                Python hangs indefinitely trying to delete deeply recursive object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How CUDA constant memory allocation works?

Tags:

memory

constants

cuda

nvidia

gpu-constant-memory

hthms

People also ask

1 Answers

Greg Smith

Recent Activity

Donate For Us