I'm having problems understanding the internals of TF (or rather have slowed down on progress). For the last three days I've been digging through the code (from the "top", going down). I've grasped graph creation and most of the stuff that happens before the OpKernel::Compute gets called. Just a fast summary (please correct me if I've got something important wrong):
There:
There:
So the problem is partially that I don't know how to go about the rest of the code (efficiently), and mostly that I can't grasp how OpKernels schedule instances themselves on the GPU and when. Prior to examining TF I read about GPGPU (the openCL docs) and I think I get the gist of it - work-items , groups, compute queues , synchronizations, memory management, types of memory on the physical device. But I can't seem to map this knowledge to how TF uses the GPU. As I said - I reached the OpKernel::Compute and there (in several types of kernels) I see only memory allocation, in some cases (matmul) CUBlass gets called (so I can't see what's happening), or there is just no GPU implementation . I expected to see some mechanism to assign a variable number of threads (work-items) to a kernel based on the analysis of the graph, some synchronization points being setup and so on. I'd be very thankful for any clarification on the above topics.
The OpKernel::Compute method gets called where if there's any kernel to be executed it is practically queued on the stream (which is the equivalent of a command queue in OpenCL).
A package with Tensorflow (both CPU and GPU) implementation of most popular Kernels for kernels methods (SVM, MKL...). Those kernels works with tensor as inputs. The main idea of this project is to exploit the powerfull of GPUs and modern CPUs on matrix and kernels elaborations.
One more day on it and I think I got it:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With