How to determine which lines of CUDA use the most registers?

Question

I have a somewhat complex kernel with the following stats:

ptxas info    : Compiling entry function 'my_kernel' for 'sm_21'
ptxas info    : Function properties for my_kernel
    32 bytes stack frame, 64 bytes spill stores, 40 bytes spill loads
ptxas info    : Used 62 registers, 120 bytes cmem[0], 128 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16]

It's not clear to me which part of the kernel is the "high water mark" in terms of register usage. The nature of the kernel is such that stubbing out various parts for constant values causes the optimizer to constant-fold later parts, etc. (at least that's how it seems, since the numbers I get back when I do so don't make much sense).

The CUDA profiler is similarly unhelpful AFAICT, simply telling me that I have register pressure.

Is there a way to get more information about register usage? I'd prefer a tool of some kind, but I'd also be interested in hearing about digging into the compiled binary directly, if that's what it takes.

Edit: It is certainly possible for me to approach this bottom-up (ie. making experimental code changes, checking the impact on register usage, etc.) but I'd rather start top-down, or at least get some guidance on where to begin bottom-up investigation.

talonmies · Accepted Answer

You can get a feel for the complexity of the compiler output by compiling to annotated PTX like this:

nvcc -ptx -Xopencc="-LIST:source=on" branching.cu

which will issue a PTX assembler file with the original C code inside it as comments:

        .entry _Z11branchTest0PfS_S_ (
                .param .u64 __cudaparm__Z11branchTest0PfS_S__a,
                .param .u64 __cudaparm__Z11branchTest0PfS_S__b,
                .param .u64 __cudaparm__Z11branchTest0PfS_S__d)
        {
        .reg .u16 %rh<4>;
        .reg .u32 %r<5>;
        .reg .u64 %rd<10>;
        .reg .f32 %f<5>;
        .loc    16      1       0
 //   1  __global__ void branchTest0(float *a, float *b, float *d)
$LDWbegin__Z11branchTest0PfS_S_:
        .loc    16      7       0
 //   3         unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
 //   4         float aval = a[tidx], bval = b[tidx];
 //   5         float z0 = (aval > bval) ? aval : bval;
 //   6  
 //   7         d[tidx] = z0;
        mov.u16         %rh1, %ctaid.x;
        mov.u16         %rh2, %ntid.x;
        mul.wide.u16    %r1, %rh1, %rh2;
        cvt.u32.u16     %r2, %tid.x;
        add.u32         %r3, %r2, %r1;
        cvt.u64.u32     %rd1, %r3;
        mul.wide.u32    %rd2, %r3, 4;
        ld.param.u64    %rd3, [__cudaparm__Z11branchTest0PfS_S__a];
        add.u64         %rd4, %rd3, %rd2;
        ld.global.f32   %f1, [%rd4+0];
        ld.param.u64    %rd5, [__cudaparm__Z11branchTest0PfS_S__b];
        add.u64         %rd6, %rd5, %rd2;
        ld.global.f32   %f2, [%rd6+0];
        max.f32         %f3, %f1, %f2;
        ld.param.u64    %rd7, [__cudaparm__Z11branchTest0PfS_S__d];
        add.u64         %rd8, %rd7, %rd2;
        st.global.f32   [%rd8+0], %f3;
        .loc    16      8       0
 //   8  }
        exit;
$LDWend__Z11branchTest0PfS_S_:
        } // _Z11branchTest0PfS_S_

Note that this doesn't directly tell you anything about the register usage, because PTX uses static-single assignment, but it shows you what the assembler is given as an input and how it relates to your original code. With the CUDA 4.0 toolkit, you can then compile the C to a cubin file for the Fermi architecture:

$ nvcc -cubin -arch=sm_20 -Xptxas="-v" branching.cu
ptxas info    : Compiling entry function '_Z11branchTest1PfS_S_' for 'sm_20'
ptxas info    : Function properties for _Z11branchTest1PfS_S_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

and use the cuobjdump utility to disassemble the machine code the assembler produces.

$ cuobjdump -sass branching.cubin 

code for sm_20
    Function : _Z11branchTest0PfS_S_
/*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
/*0008*/     /*0x94001c042c000000*/     S2R R0, SR_CTAid_X;
/*0010*/     /*0x84009c042c000000*/     S2R R2, SR_Tid_X;
/*0018*/     /*0x10015de218000000*/     MOV32I R5, 0x4;
/*0020*/     /*0x2000dc0320044000*/     IMAD.U32.U32 R3, R0, c [0x0] [0x8], R2;
/*0028*/     /*0x10311c435000c000*/     IMUL.U32.U32.HI R4, R3, 0x4;
/*0030*/     /*0x80319c03200b8000*/     IMAD.U32.U32 R6.CC, R3, R5, c [0x0] [0x20];
/*0038*/     /*0x9041dc4348004000*/     IADD.X R7, R4, c [0x0] [0x24];
/*0040*/     /*0xa0321c03200b8000*/     IMAD.U32.U32 R8.CC, R3, R5, c [0x0] [0x28];
/*0048*/     /*0x00609c8584000000*/     LD.E R2, [R6];
/*0050*/     /*0xb0425c4348004000*/     IADD.X R9, R4, c [0x0] [0x2c];
/*0058*/     /*0xc0329c03200b8000*/     IMAD.U32.U32 R10.CC, R3, R5, c [0x0] [0x30];
/*0060*/     /*0x00801c8584000000*/     LD.E R0, [R8];
/*0068*/     /*0xd042dc4348004000*/     IADD.X R11, R4, c [0x0] [0x34];
/*0070*/     /*0x00201c00081e0000*/     FMNMX R0, R2, R0, !pt;
/*0078*/     /*0x00a01c8594000000*/     ST.E [R10], R0;
/*0080*/     /*0x00001de780000000*/     EXIT;
    ......................................

It is usually possible to trace back from assembler to PTX and get at least a rough idea where the "greedy" code sections are. Having said all that, managing register pressure is one of the more difficult aspects of CUDA programming at the moment. If/when NVIDIA ever document their ELF format for device code, I reckon a proper code analyzing tool would be a great project for someone.

How to determine which lines of CUDA use the most registers?

Tags:

cuda

Eli Stevens

1 Answers

talonmies

Recent Activity

Donate For Us

How to determine which lines of CUDA use the most registers?

Tags:

cuda

Eli Stevens

1 Answers

talonmies

Related questions

Recent Activity

Donate For Us