Well, I have quite a delicate question :)
Let's start with what I have:
What I want: Basically just want to make this as effective (fast) as possible, eg. avoid compilation of CUDA to PTX. Solution can be even completely device-specific, no big compatibility is required here :)
What I know: I already know function cuLoadModule, which can load and create kernel from PTX code stored in file. But I think, there must be some other way to create a kernel directly, without saving it to file first. Or perhaps it may be possible to store it as bytecode?
My question: How would you do that? Could you post an example or link to website with similar topic? TY
Edit: OK now, PTX kernel can be run from PTX string (char array) directly. Anyways I still wonder, is there some better / faster solution to this? There is still conversion from string to some PTX bytecode, which should be possibly avoided. I also suspect, that some clever way of creating device specific Cuda binary from PTX might exist, which would remove JIT compiler lag (is small, but it can add up if you have huge numbers of kernels to run) :)
In order to compile CUDA code files, you have to use nvcc compiler. Cuda codes can only be compiled and executed on node that have a GPU. Heracles has 4 Nvidia Tesla P100 GPUs on node18. Cuda Compiler is installed on node 18, so you need ssh to compile cuda programs.
In order to run a kernel on the CUDA threads, we need two things. First, in the main() function of the program, we call the function to be executed by each thread on the GPU. This invocation is called Kernel Launch and with it we need provide the number of threads and their grouping.
In his comment, Roger Dahl has linked the following post
Passing the PTX program to the CUDA driver directly
in which the use of two functions, namely cuModuleLoad
and cuModuleLoadDataEx
, are addressed. The former is used to load PTX code from file and passing it to the nvcc
compiler driver. The latter avoids I/O and enables to pass the PTX code to the driver as a C string. In either cases, you need to have already at your disposal the PTX code, either as the result of the compilation of a CUDA kernel (to be loaded or copied and pasted in the C string) or as an hand-written source.
But what happens if you have to create the PTX code on-the-fly starting from a CUDA kernel? Following the approach in CUDA Expression templates, you can define a string containing your CUDA kernel like
ss << "extern \"C\" __global__ void kernel( ";
ss << def_line.str() << ", unsigned int vector_size, unsigned int number_of_used_threads ) { \n";
ss << "\tint idx = blockDim.x * blockIdx.x + threadIdx.x; \n";
ss << "\tfor(unsigned int i = 0; i < ";
ss << "(vector_size + number_of_used_threads - 1) / number_of_used_threads; ++i) {\n";
ss << "\t\tif(idx < vector_size) { \n";
ss << "\t\t\t" << eval_line.str() << "\n";
ss << "\t\t\tidx += number_of_used_threads;\n";
ss << "\t\t}\n";
ss << "\t}\n";
ss << "}\n\n\n\n";
then using system calls to compile it as
int nvcc_exit_status = system(
(std::string(NVCC) + " -ptx " + NVCC_FLAGS + " " + kernel_filename
+ " -o " + kernel_comp_filename).c_str()
);
if (nvcc_exit_status) {
std::cerr << "ERROR: nvcc exits with status code: " << nvcc_exit_status << std::endl;
exit(1);
}
and finally use cuModuleLoad
and cuModuleGetFunction
to load the PTX code from file and passing it to the compiler driver like
result = cuModuleLoad(&cuModule, kernel_comp_filename.c_str());
assert(result == CUDA_SUCCESS);
result = cuModuleGetFunction(&cuFunction, cuModule, "kernel");
assert(result == CUDA_SUCCESS);
Of course, expression templates have nothing to do with this problem and I'm only quoting the source of the ideas I'm reporting in this answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With