Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding OpKernel::Compute and how tensorflow sets up GPU execution

Tags:

I'm having problems understanding the internals of TF (or rather have slowed down on progress). For the last three days I've been digging through the code (from the "top", going down). I've grasped graph creation and most of the stuff that happens before the OpKernel::Compute gets called. Just a fast summary (please correct me if I've got something important wrong):

  1. Define a graph wherever
  2. Call DirectSession::Run() Internally:
  3. The graph gets processed (split/optimized/etc...)
  4. Executors get created for each subgraph
  5. A RunState gets created
  6. Inputs get sent (/stored) in the Rendevouz
  7. Executor::RunAsync() for all executors

There:

  1. FillContextMap assigns contexts to all nodes
  2. The Process(TaggedNode,..) function gets scheduled to the ThreadPool for all root nodes

There:

  1. The input tensors and parameters get prepared (and an OpKernel and OpContext get created)
  2. Device::Compute(kernel,context,..) gets called (sync or async)
  3. Device::Compute really just calls OpKernel::Compute from what I've seen

So the problem is partially that I don't know how to go about the rest of the code (efficiently), and mostly that I can't grasp how OpKernels schedule instances themselves on the GPU and when. Prior to examining TF I read about GPGPU (the openCL docs) and I think I get the gist of it - work-items , groups, compute queues , synchronizations, memory management, types of memory on the physical device. But I can't seem to map this knowledge to how TF uses the GPU. As I said - I reached the OpKernel::Compute and there (in several types of kernels) I see only memory allocation, in some cases (matmul) CUBlass gets called (so I can't see what's happening), or there is just no GPU implementation . I expected to see some mechanism to assign a variable number of threads (work-items) to a kernel based on the analysis of the graph, some synchronization points being setup and so on. I'd be very thankful for any clarification on the above topics.

like image 773
petko10 Avatar asked Jan 26 '17 11:01

petko10


People also ask

What is OpKernel?

The OpKernel::Compute method gets called where if there's any kernel to be executed it is practically queued on the stream (which is the equivalent of a command queue in OpenCL).

What is a Tensorflow kernel?

A package with Tensorflow (both CPU and GPU) implementation of most popular Kernels for kernels methods (SVM, MKL...). Those kernels works with tensor as inputs. The main idea of this project is to exploit the powerfull of GPUs and modern CPUs on matrix and kernels elaborations.


1 Answers

One more day on it and I think I got it:

  1. Device::Compute, when the device is a GPU device, has a different implementation: it checks if the inputs are in a context different from the on for this node (and wait for their completion via an event hook if they're not)
  2. Then the current stream gets changed (global for CUDA I think) and
  3. The OpKernel::Compute method gets called where if there's any kernel to be executed it is practically queued on the stream (which is the equivalent of a command queue in OpenCL).
like image 169
petko10 Avatar answered Sep 25 '22 10:09

petko10