I'm running kernel benchmarks with OpenCL. I know that I can compile kernels offline with various tools from OpenCL vendors (i.e. ioc64
or poclcc
). The problem is that I get performance results that I cannot explain with the assembly from these tools, the OpenCL runtime overhead or similar.
I would like to see the assembly of online compiled kernels that are compiled and executed by my benchmark program. Any ways to do that?
My approach is to get this assembly somewhere from the cl::program
or cl::kernel
objects but I haven't found any way to do that. I appreciate your advice or solutions.
For Intel Graphics you can use clGetKernelInfo(...,CL_KERNEL_BINARY_PROGRAM_INTEL,...)
to directly get the kernel ISA bits. To disassemble those bits, you can get the latest GEN ISA disassembler and build it as described here. Specifically, see the section on Building an Intel GPU ISA Disassembler
. I haven't used it in a while, but The Intel OpenCL SDK used to do a better job (not a GUI person). And this is a good article on how to use that tool to scrutinize the assembly.
For NVidia, the "binary" returned by clGetProgramInfo(...CL_PROGRAM_BINARIES...)
actually returns ptx. This might be enough, but if you want the exact shader assembly executed, then you can actually feed the ptx into ptxas
and then disassemble cuobjdump
with the --dump-sass
option to get the lowest level assembly. Note, we're reduced to guessing that the NVidia driver is using the same algorithm as ptxas
, but it seems logical.
AMD likely has similar tools, but I am less versed on them.
in clBuildProgram
call you can pass compiler options.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With