Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyTorch - Where are kernels launched?

Tags:

pytorch

I need to get information about kernels that PyTorch launches. For example, a callstack information such as "main.py:24 -> ... -> callkernel.py:53" would be beneficial. Is there anyway I can gather this information out out a PyTorch application execution? I also am currently searching through the source code of PyTorch but I still could not find a line where a CUDA kernel is launched. My questions are twofold:

  • Can I get callstack at the time of kernel launch?
  • Can someone show me an example of kernel launch in the source of PyTorch?
like image 709
Serkan Göktaş Avatar asked Aug 31 '25 20:08

Serkan Göktaş


1 Answers

To get a helpful stack trace, you would most likely need to build pytorch with debug symbols (build instructions are here). I'm not sure if there are any debug builds available to download. But a stack trace might not make very much sense without some background, so here's a general outline of where things are defined in the codebase:

Most operators in PyTorch are implemented in the codebase as a C++ at::native namespace function within pytorch/aten/src/ATen/native. When PyTorch is built, codegen scripts automatically generate the Python functions and the Python-to-C++ bindings for the operators defined in native_functions.yaml, and the generated code is not checked into the repo (so you would have to either read the scripts or build PyTorch yourself if you want to see what's going on in codegen).

An at::native operator will usually call a device dispatch function for that operator, which is often suffixed with _stub. The dispatch function checks what device (cpu, cuda, etc) the arguments are on, and then runs a device-specific implementation. From there, another dispatch happens, which calls a datatype-specific implementation.

To go through an example, the add.out operator (which is called when you do torch.add(..., out=...) in Python) is declared here. Codegen generates everything needed to bind the Python function to at::native::add_out, which is defined here. Notice that that function calls add_stub, which is the device dispatch function.

A CPU implementation for add_stub is registered here and implemented here as add_kernel. A CUDA implementation is registered here and implemented here as add_kernel_cuda. Notice that both of these use a TensorIteratorBase object. Long story short, this object will iterate through each pair of elements in the tensor inputs that should be added together.

There is another dispatch within add_kernel and add_kernel_cuda which chooses a separate implementation based on the data type of the arguments. The separate data type implementations are generated from a shared template function. You can see that the CPU function also has a different implementation for a vectorized and a non-vectorized operation, while the CUDA implementation just has the one here.

If you want to see a full stack trace, you could run a script with gdb --args python <script name>, and create a break point for the specific kernel you want. Again, debug symbols would be needed to make sense of it.

like image 177
Kurt Mohler Avatar answered Sep 04 '25 00:09

Kurt Mohler