How to compile PTX code

Tags:

I need to modify the PTX code and compile it directly. The reason is that I want to have some specific instructions right after each other and it is difficult to write a cuda code that results my target PTX code, So I need to modify ptx code directly. The problem is that I can compile it to (fatbin and cubin) but I dont know how to compile those (.fatbin and .cubin) to "X.o" file.

215

asked Nov 15 '13 23:11

user2998135

Video Answer

1 Answers

There may be a way to do this with an orderly sequence of nvcc commands, but I'm not aware of it and haven't discovered it.

One possible approach however, albeit messy, is to interrupt and restart the cuda compilation sequence, and edit the ptx file in the interim (before the restart). This is based on information provided in the nvcc manual, and I would not consider this a standard methodology, so your mileage may vary. There may be any number of scenarios that I haven't considered where this doesn't work or isn't feasible.

In order to explain this I shall present an example code:

#include <stdio.h>

__global__ void mykernel(int *data){

  (*data)++;
}

int main(){

  int *d_data, h_data = 0;
  cudaMalloc((void **)&d_data, sizeof(int));
  cudaMemcpy(d_data, &h_data, sizeof(int), cudaMemcpyHostToDevice);
  mykernel<<<1,1>>>(d_data);
  cudaMemcpy(&h_data, d_data, sizeof(int), cudaMemcpyDeviceToHost);
  printf("data = %d\n", h_data);
  return 0;
}

For this purpose, I am dispensing with cuda error checking and other niceties, in favor of brevity.

Ordinarily we might compile the above code as follows:

nvcc -arch=sm_20 -o t266 t266.cu

(assuming the source file is named t266.cu)

Instead, based on the reference manual, we'll compile as follows:

nvcc -arch=sm_20 -keep -o t266 t266.cu

This will build the executable, but will keep all intermediate files, including t266.ptx (which contains the ptx code for mykernel)

If we simply ran the executable at this point, we'd get output like this:

$ ./t266
data = 1
$

The next step will be to edit the ptx file to make whatever changes we want. In this case, we'll have the kernel add 2 to the data variable instead of adding 1. The relevant line is:

    add.s32         %r2, %r1, 2;
                              ^
                              |
                          change the 1 to a 2 here

Now comes the messy part. The next step is to capture all the intermediate compile commands, so we can rerun some of them:

nvcc -dryrun -arch=sm_20 -o t266 t266.cu --keep 2>dryrun.out

(Using linux redirection of stderr here). We then want to edit that dryrun.out file so that:

we retain all the commands after the creation of the ptx file, up to the end of the file. The line that creates the ptx file will be evident as the one which specifies -o "t266.ptx"
we strip out the leading #$ that each line begins with, so in effect we are creating a script.

When I perform the above 2 steps, I end up with a script like this:

ptxas  -arch=sm_20 -m64  "t266.ptx"  -o "t266.sm_20.cubin"
fatbinary --create="t266.fatbin" -64 --key="xxxxxxxxxx" --ident="t266.cu" "--image=profile=sm_20,file=t266.sm_20.cubin" "--image=profile=compute_20,file=t266.ptx" --embedded-fatbin="t266.fatbin.c" --cuda
gcc -D__CUDA_ARCH__=200 -E -x c++   -DCUDA_DOUBLE_MATH_FUNCTIONS   -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT "-I/usr/local/cuda/bin/..//include"   -m64 -o "t266.cu.cpp.ii" "t266.cudafe1.cpp"
gcc -c -x c++ "-I/usr/local/cuda/bin/..//include"   -fpreprocessed -m64 -o "t266.o" "t266.cu.cpp.ii"
nvlink --arch=sm_20 --register-link-binaries="t266_dlink.reg.c" -m64   "-L/usr/local/cuda/bin/..//lib64" "t266.o"  -o "t266_dlink.sm_20.cubin"
fatbinary --create="t266_dlink.fatbin" -64 --key="t266_dlink" --ident="t266.cu " -link "--image=profile=sm_20,file=t266_dlink.sm_20.cubin" --embedded-fatbin="t266_dlink.fatbin.c"
gcc -c -x c++ -DFATBINFILE="\"t266_dlink.fatbin.c\"" -DREGISTERLINKBINARYFILE="\"t266_dlink.reg.c\"" -I. "-I/usr/local/cuda/bin/..//include"   -m64 -o "t266_dlink.o" "/usr/local/cuda/bin/crt/link.stub"
g++ -m64 -o "t266" -Wl,--start-group "t266_dlink.o" "t266.o"   "-L/usr/local/cuda/bin/..//lib64" -lcudart_static  -lrt -lpthread -ldl  -Wl,--end-group

Finally, execute the above script. (in linux you can make this script file executable using chmod +x dryrun.out or similar.) If you haven't made any mistakes while editing the .ptx file, the commands should all complete successfully, and create a new t266 executable file.

When we run that file, we observe:

$ ./t266
data = 2
$

Indicating that our changes were successful.

125

answered Oct 17 '22 07:10

Robert Crovella

Related questions
                            
                                What is a good sorting algorithm on CUDA?
                            
                                Is CUDA hardware needed at compile time?
                            
                                Where are the GPU functions on OpenCV 3.0?
                            
                                What is the best way to learn CUDA? [closed]
                            
                                CUDA streams not overlapping
                            
                                Global variable in CUDA
                            
                                How to output C/C++ annotated PTX in CUDA 4.1/4.2/5.0
                            
                                Asynchronous cuBLAS calls
                            
                                CUDA, mutex and atomicCAS()
                            
                                TensorFlow GPU: is cudnn optional? Couldn't open CUDA library libcudnn.so
                            
                                C++ volatile and operator overloading for CUDA application
                            
                                How to use CUDA constant memory in a programmer pleasant way?
                            
                                How to transpose a matrix in an optimal way using blas?
                            
                                CUDA - why is warp based parallel reduction slower?
                            
                                differences between virtual and real architecture of cuda
                            
                                What are the real C++ language constructs supported by CUDA device code?
                            
                                CUDA - Implementing Device Hash Map?
                            
                                1D Min-convolution in CUDA
                            
                                Where to find CUDA's cutil_math.h?
                            
                                adding a cuda file to an existing c project in visual studio

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to compile PTX code

Tags:

cuda

nvcc

ptx

user2998135

People also ask

Video Answer

1 Answers

Robert Crovella

Recent Activity

Donate For Us