I have a somewhat complex kernel with the following stats:
ptxas info : Compiling entry function 'my_kernel' for 'sm_21'
ptxas info : Function properties for my_kernel
32 bytes stack frame, 64 bytes spill stores, 40 bytes spill loads
ptxas info : Used 62 registers, 120 bytes cmem[0], 128 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16]
It's not clear to me which part of the kernel is the "high water mark" in terms of register usage. The nature of the kernel is such that stubbing out various parts for constant values causes the optimizer to constant-fold later parts, etc. (at least that's how it seems, since the numbers I get back when I do so don't make much sense).
The CUDA profiler is similarly unhelpful AFAICT, simply telling me that I have register pressure.
Is there a way to get more information about register usage? I'd prefer a tool of some kind, but I'd also be interested in hearing about digging into the compiled binary directly, if that's what it takes.
Edit: It is certainly possible for me to approach this bottom-up (ie. making experimental code changes, checking the impact on register usage, etc.) but I'd rather start top-down, or at least get some guidance on where to begin bottom-up investigation.
You can get a feel for the complexity of the compiler output by compiling to annotated PTX like this:
nvcc -ptx -Xopencc="-LIST:source=on" branching.cu
which will issue a PTX assembler file with the original C code inside it as comments:
.entry _Z11branchTest0PfS_S_ (
.param .u64 __cudaparm__Z11branchTest0PfS_S__a,
.param .u64 __cudaparm__Z11branchTest0PfS_S__b,
.param .u64 __cudaparm__Z11branchTest0PfS_S__d)
{
.reg .u16 %rh<4>;
.reg .u32 %r<5>;
.reg .u64 %rd<10>;
.reg .f32 %f<5>;
.loc 16 1 0
// 1 __global__ void branchTest0(float *a, float *b, float *d)
$LDWbegin__Z11branchTest0PfS_S_:
.loc 16 7 0
// 3 unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
// 4 float aval = a[tidx], bval = b[tidx];
// 5 float z0 = (aval > bval) ? aval : bval;
// 6
// 7 d[tidx] = z0;
mov.u16 %rh1, %ctaid.x;
mov.u16 %rh2, %ntid.x;
mul.wide.u16 %r1, %rh1, %rh2;
cvt.u32.u16 %r2, %tid.x;
add.u32 %r3, %r2, %r1;
cvt.u64.u32 %rd1, %r3;
mul.wide.u32 %rd2, %r3, 4;
ld.param.u64 %rd3, [__cudaparm__Z11branchTest0PfS_S__a];
add.u64 %rd4, %rd3, %rd2;
ld.global.f32 %f1, [%rd4+0];
ld.param.u64 %rd5, [__cudaparm__Z11branchTest0PfS_S__b];
add.u64 %rd6, %rd5, %rd2;
ld.global.f32 %f2, [%rd6+0];
max.f32 %f3, %f1, %f2;
ld.param.u64 %rd7, [__cudaparm__Z11branchTest0PfS_S__d];
add.u64 %rd8, %rd7, %rd2;
st.global.f32 [%rd8+0], %f3;
.loc 16 8 0
// 8 }
exit;
$LDWend__Z11branchTest0PfS_S_:
} // _Z11branchTest0PfS_S_
Note that this doesn't directly tell you anything about the register usage, because PTX uses static-single assignment, but it shows you what the assembler is given as an input and how it relates to your original code. With the CUDA 4.0 toolkit, you can then compile the C to a cubin file for the Fermi architecture:
$ nvcc -cubin -arch=sm_20 -Xptxas="-v" branching.cu
ptxas info : Compiling entry function '_Z11branchTest1PfS_S_' for 'sm_20'
ptxas info : Function properties for _Z11branchTest1PfS_S_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
and use the cuobjdump
utility to disassemble the machine code the assembler produces.
$ cuobjdump -sass branching.cubin
code for sm_20
Function : _Z11branchTest0PfS_S_
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x94001c042c000000*/ S2R R0, SR_CTAid_X;
/*0010*/ /*0x84009c042c000000*/ S2R R2, SR_Tid_X;
/*0018*/ /*0x10015de218000000*/ MOV32I R5, 0x4;
/*0020*/ /*0x2000dc0320044000*/ IMAD.U32.U32 R3, R0, c [0x0] [0x8], R2;
/*0028*/ /*0x10311c435000c000*/ IMUL.U32.U32.HI R4, R3, 0x4;
/*0030*/ /*0x80319c03200b8000*/ IMAD.U32.U32 R6.CC, R3, R5, c [0x0] [0x20];
/*0038*/ /*0x9041dc4348004000*/ IADD.X R7, R4, c [0x0] [0x24];
/*0040*/ /*0xa0321c03200b8000*/ IMAD.U32.U32 R8.CC, R3, R5, c [0x0] [0x28];
/*0048*/ /*0x00609c8584000000*/ LD.E R2, [R6];
/*0050*/ /*0xb0425c4348004000*/ IADD.X R9, R4, c [0x0] [0x2c];
/*0058*/ /*0xc0329c03200b8000*/ IMAD.U32.U32 R10.CC, R3, R5, c [0x0] [0x30];
/*0060*/ /*0x00801c8584000000*/ LD.E R0, [R8];
/*0068*/ /*0xd042dc4348004000*/ IADD.X R11, R4, c [0x0] [0x34];
/*0070*/ /*0x00201c00081e0000*/ FMNMX R0, R2, R0, !pt;
/*0078*/ /*0x00a01c8594000000*/ ST.E [R10], R0;
/*0080*/ /*0x00001de780000000*/ EXIT;
......................................
It is usually possible to trace back from assembler to PTX and get at least a rough idea where the "greedy" code sections are. Having said all that, managing register pressure is one of the more difficult aspects of CUDA programming at the moment. If/when NVIDIA ever document their ELF format for device code, I reckon a proper code analyzing tool would be a great project for someone.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With