I'm having problems understanding the internals of TF (or rather have slowed down on progress). For the last three days I've been digging through the code (from the "top", going down). I've grasped graph creation and most of the stuff that happens before the OpKernel::Compute gets called. Just a fast summary (please correct me if I've got something important wrong): <ol> <li>Define a graph wherever</li> <li>Call DirectSession::Run() Internally:</li> <li>The graph gets processed (split/optimized/etc...)</li> <li>Executors get created for each subgraph</li> <li>A RunState gets created</li> <li>Inputs get sent (/stored) in the Rendevouz</li> <li>Executor::RunAsync() for all executors</li> </ol> There: <ol start="8"> <li>FillContextMap assigns contexts to all nodes</li> <li>The Process(TaggedNode,..) function gets scheduled to the ThreadPool for all root nodes</li> </ol> There: <ol start="10"> <li>The input tensors and parameters get prepared (and an OpKernel and OpContext get created)</li> <li>Device::Compute(kernel,context,..) gets called (sync or async)</li> <li>Device::Compute really just calls OpKernel::Compute from what I've seen</li> </ol> So the problem is partially that I don't know how to go about the rest of the code (efficiently), and mostly that I can't grasp how OpKernels schedule instances themselves on the GPU and when. Prior to examining TF I read about GPGPU (the openCL docs) and I think I get the gist of it - work-items , groups, compute queues , synchronizations, memory management, types of memory on the physical device. But I can't seem to map this knowledge to how TF uses the GPU. As I said - I reached the OpKernel::Compute and there (in several types of kernels) I see only memory allocation, in some cases (matmul) CUBlass gets called (so I can't see what's happening), or there is just no GPU implementation . I expected to see some mechanism to assign a variable number of threads (work-items) to a kernel based on the analysis of the graph, some synchronization points being setup and so on. I'd be very thankful for any clarification on the above topics.

One more day on it and I think I got it: <ol start="12"> <li>Device::Compute, when the device is a GPU device, has a different implementation: it checks if the inputs are in a context different from the on for this node (and wait for their completion via an event hook if they're not)</li> <li>Then the current stream gets changed (global for CUDA I think) and </li> <li>The OpKernel::Compute method gets called where if there's any kernel to be executed it is practically queued on the stream (which is the equivalent of a command queue in OpenCL).</li> </ol>

Understanding OpKernel::Compute and how tensorflow sets up GPU execution

Tags:

I'm having problems understanding the internals of TF (or rather have slowed down on progress). For the last three days I've been digging through the code (from the "top", going down). I've grasped graph creation and most of the stuff that happens before the OpKernel::Compute gets called. Just a fast summary (please correct me if I've got something important wrong):

Define a graph wherever
Call DirectSession::Run() Internally:
The graph gets processed (split/optimized/etc...)
Executors get created for each subgraph
A RunState gets created
Inputs get sent (/stored) in the Rendevouz
Executor::RunAsync() for all executors

There:

FillContextMap assigns contexts to all nodes
The Process(TaggedNode,..) function gets scheduled to the ThreadPool for all root nodes

There:

The input tensors and parameters get prepared (and an OpKernel and OpContext get created)
Device::Compute(kernel,context,..) gets called (sync or async)
Device::Compute really just calls OpKernel::Compute from what I've seen

So the problem is partially that I don't know how to go about the rest of the code (efficiently), and mostly that I can't grasp how OpKernels schedule instances themselves on the GPU and when. Prior to examining TF I read about GPGPU (the openCL docs) and I think I get the gist of it - work-items , groups, compute queues , synchronizations, memory management, types of memory on the physical device. But I can't seem to map this knowledge to how TF uses the GPU. As I said - I reached the OpKernel::Compute and there (in several types of kernels) I see only memory allocation, in some cases (matmul) CUBlass gets called (so I can't see what's happening), or there is just no GPU implementation . I expected to see some mechanism to assign a variable number of threads (work-items) to a kernel based on the analysis of the graph, some synchronization points being setup and so on. I'd be very thankful for any clarification on the above topics.

773

asked Jan 26 '17 11:01

petko10

1 Answers

One more day on it and I think I got it:

Device::Compute, when the device is a GPU device, has a different implementation: it checks if the inputs are in a context different from the on for this node (and wait for their completion via an event hook if they're not)
Then the current stream gets changed (global for CUDA I think) and
The OpKernel::Compute method gets called where if there's any kernel to be executed it is practically queued on the stream (which is the equivalent of a command queue in OpenCL).

169

answered Sep 25 '22 10:09

petko10

Related questions
                            
                                Application Verifier vs Gflags
                            
                                Can I have a project-specific custom code deprecation message using eslint?
                            
                                Why do native threads behave different when app is in background?
                            
                                How to ignore trailing whitespace errors with git stash
                            
                                Command invalid: run always ask to create a project
                            
                                watch file for new content and retrieve that content in NodeJS
                            
                                How to extract public key from a x509 certificate in python?
                            
                                React Native : Download and open pdf file, need to work on IOS and Android
                            
                                java.lang.NegativeArraySizeException when using URLConnection
                            
                                Remove columns from dataframe where ALL values are NA, NULL or empty [duplicate]
                            
                                Is there a way to restrict value of a column in Entity Framework?
                            
                                Browser extensions: Send messages (with response) between browser-action-popup and background-script

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Understanding OpKernel::Compute and how tensorflow sets up GPU execution

Tags:

petko10

People also ask

1 Answers

petko10

Recent Activity

Donate For Us