How to generate, compile and run CUDA kernels at runtime

Tags:

Well, I have quite a delicate question :)

Let's start with what I have:

Data, large array of data, copied to GPU
Program, generated by CPU (host), which needs to be evaluated for every data in that array
The program changes very frequently, can be generated as CUDA string, PTX string or something else (?) and needs to be re-evaluated after each change

What I want: Basically just want to make this as effective (fast) as possible, eg. avoid compilation of CUDA to PTX. Solution can be even completely device-specific, no big compatibility is required here :)

What I know: I already know function cuLoadModule, which can load and create kernel from PTX code stored in file. But I think, there must be some other way to create a kernel directly, without saving it to file first. Or perhaps it may be possible to store it as bytecode?

My question: How would you do that? Could you post an example or link to website with similar topic? TY

Edit: OK now, PTX kernel can be run from PTX string (char array) directly. Anyways I still wonder, is there some better / faster solution to this? There is still conversion from string to some PTX bytecode, which should be possibly avoided. I also suspect, that some clever way of creating device specific Cuda binary from PTX might exist, which would remove JIT compiler lag (is small, but it can add up if you have huge numbers of kernels to run) :)

536

asked Nov 07 '13 14:11

teejay

1 Answers

In his comment, Roger Dahl has linked the following post

Passing the PTX program to the CUDA driver directly

in which the use of two functions, namely cuModuleLoad and cuModuleLoadDataEx, are addressed. The former is used to load PTX code from file and passing it to the nvcc compiler driver. The latter avoids I/O and enables to pass the PTX code to the driver as a C string. In either cases, you need to have already at your disposal the PTX code, either as the result of the compilation of a CUDA kernel (to be loaded or copied and pasted in the C string) or as an hand-written source.

But what happens if you have to create the PTX code on-the-fly starting from a CUDA kernel? Following the approach in CUDA Expression templates, you can define a string containing your CUDA kernel like

ss << "extern \"C\" __global__ void kernel( ";
ss << def_line.str() << ", unsigned int vector_size, unsigned int number_of_used_threads ) { \n";
ss << "\tint idx = blockDim.x * blockIdx.x + threadIdx.x; \n";
ss << "\tfor(unsigned int i = 0; i < ";
ss << "(vector_size + number_of_used_threads - 1) / number_of_used_threads; ++i) {\n";
ss << "\t\tif(idx < vector_size) { \n";
ss << "\t\t\t" << eval_line.str() << "\n";
ss << "\t\t\tidx += number_of_used_threads;\n";
ss << "\t\t}\n";
ss << "\t}\n";
ss << "}\n\n\n\n";

then using system calls to compile it as

int nvcc_exit_status = system(
         (std::string(NVCC) + " -ptx " + NVCC_FLAGS + " " + kernel_filename 
              + " -o " + kernel_comp_filename).c_str()
    );

    if (nvcc_exit_status) {
            std::cerr << "ERROR: nvcc exits with status code: " << nvcc_exit_status << std::endl;
            exit(1);
    }

and finally use cuModuleLoad and cuModuleGetFunction to load the PTX code from file and passing it to the compiler driver like

    result = cuModuleLoad(&cuModule, kernel_comp_filename.c_str());
    assert(result == CUDA_SUCCESS);
    result =  cuModuleGetFunction(&cuFunction, cuModule, "kernel");
    assert(result == CUDA_SUCCESS);

Of course, expression templates have nothing to do with this problem and I'm only quoting the source of the ideas I'm reporting in this answer.

answered Oct 19 '22 10:10

Vitality

Related questions
                            
                                Is there build-in cross and dot products in CUDA?
                            
                                Getting stack overflows with a CUDA kernel
                            
                                cuda-memcheck, how to get from address to source code?
                            
                                How is 3D texture memory cached?
                            
                                How to write a pointer-chasing benchmark using 64-bit pointers in CUDA?
                            
                                How to get cmake to enable cuda when compiling yolo (darknet)?
                            
                                Generalized sliding-window computation on the GPU
                            
                                Can I use thrust::host_vector or I must use cudaHostAlloc for zero-copy with Thrust?
                            
                                CUDA kernel as member function of a class
                            
                                NVCC warning level
                            
                                Segmentation fault in __pthread_getspecific called from libcuda.so.1
                            
                                Global memory access and L1 cache in Kepler
                            
                                How to have Apache Spark running on GPU?
                            
                                CUDA device runtime api cudaMemsetAsync doesn't work
                            
                                Call multiple times get_global_id() vs save the result in the local variable?
                            
                                Problem when calling template CUDA kernel
                            
                                Invalid argument in cudaMemcpy3D using width in bytes?
                            
                                How good is OpenCV GPU library for matrix operations?
                            
                                How to debug CUDA using eclipse Nsight with only one GPU
                            
                                How to measure GPU vs CPU performance? Which time measurement functions?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to generate, compile and run CUDA kernels at runtime

Tags:

compilation

cuda

gpgpu

ptx

teejay

People also ask

1 Answers

Vitality

Recent Activity

Donate For Us