I was wondering if anybody could advise on how to get peak performance out of tensorflow in a 4 GPU setting. As a test I created two of the same network (18 ish layer residual network with small filter banks (ranging from 16-128) on 32x32 inputs. Batch size 512, 128 per GPU.). One in MXNet and one I have modelled off of the inception example. My MXNet network can train at around 7k examples a second where tensorflow is only capable of 4.2k with dummy data and 3.7 with real data. (when running on 1 GPU the numbers are 1.2k examples a second vs 2.1k) In my experiment I have a few questions in hopes to speed things up. <ol> <li>GPU utilization seems quite low when training. I noticed that in the tensorflow white paper there is support for running multiple streams on the same GPU. Is this possible in the public release?</li> <li>Is there anyway to perform multiple train operations in one execution of <code>session.run()</code>? Or have async execution? This would allow for weight updates to be done at the same time as the next batches forward pass? I have tried using 2 threads (both system and with <code>QueueRunners</code>'s), but this only resulted in a slowdown. MXNet is able to increase speeds by running weight updates on the CPU so that the gpu's can be used for the next batch.</li> <li>Will the new distributed run time get around some of these issues by letting me run more than one worker on a single machine?</li> <li>Is there something else that can be done?</li> </ol> I know there are a number of similar questions here on stack overflow, but though my searching I couldn't find a solution to my problems that I have not already tried. Edit: I did a little bit of CUDA profiling to see what the expensive kernels were. According to my run, 21.4% of the time is spent inside: <pre class="prettyprint"><code>void Eigen::internal::EigenMetaKernel_NonVectorizable<Eigen::TensorEvaluator <Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=4, int=1, long>, int=16>, Eigen::TensorPaddingOp<Eigen::array<std::pair<int, int>, unsigned long=4> const, Eigen::TensorMap<Eigen::Tensor<float const, int=4, int=1, long>, int=16> const > const > const, Eigen::GpuDevice>, long>(float, int=4) </code></pre> and 20.0% of the time were spent in <pre class="prettyprint"><code>void Eigen::internal::EigenMetaKernel_NonVectorizable<Eigen::TensorEvaluator <Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=4, int=1, long>, int=16>, Eigen::TensorBroadcastingOp<Eigen::array<int, unsigned long=4> const, Eigen::TensorMap<Eigen::Tensor<float const, int=4, int=1, long>, int=16> const > const > const, Eigen::GpuDevice>, long>(float, int=4) </code></pre> Off of the Signature I am not exactly sure what these are doing. Do these make sense? In addition to this, the analysis reports low kernel concurrency, 0%, as expected. And Low compute utilization 34.9% (granted this includes start-up time and a little bit of python in train loop. Around 32 seconds total out of 91. This comes out to around 50% utilization inside tensorflow.) Edit 2: I have attached a copy of the trimmed down source code. In general though I am more concerned about question 1-3 and don't want to take too much of ever bodies time. In addition I am running on tensorflow built from: f07234db2f7b316b08f7df25417245274b63342a Edit 3: Updated to the most recent tensorflow (63409bd23facad471973b110df998782c0e19c06) same code, default data format (NHWC) and that seemed to speed this up a lot. On fake data 6.7k-6.8k (thermal dependence I think?) examples a second 4gpu. 1gpu -- 2.0k examples a second. Real data performance is around 4.9k examples a second for 4gpu. 1gpu -- 1.7k examples a second. Edit 4: In addition I tried out switching data formats to BCHW. I made the conversion modelled off of Soumith's benchmarks. The convolution parts were indeed faster, but batch norm appears to be messing everything up. With a naive implementation (fixing axis, and making weights [1,C,1,1] instead of [C,]) I am only able to get 1.2k examples a second on 4 gpu (fake data). Where as with a transpose before and after the batch norm op I am able to get 6.2k examples a second (fake data). Still slower than the NHWC data_format.

It's a bit hard to diagnose your program's performance problem without seeing the code. Is it possible for us to read your test code somehow? TensorPadding showing on the top is a bit strange. I'd expect cudnn calls should be on the top of the profile. Anyway, showing us the test code will be helpful.

Maximize tensorflow multi gpu performance

Tags:

c++

performance

tensorflow

gpu

mxnet

I was wondering if anybody could advise on how to get peak performance out of tensorflow in a 4 GPU setting.

As a test I created two of the same network (18 ish layer residual network with small filter banks (ranging from 16-128) on 32x32 inputs. Batch size 512, 128 per GPU.). One in MXNet and one I have modelled off of the inception example.

My MXNet network can train at around 7k examples a second where tensorflow is only capable of 4.2k with dummy data and 3.7 with real data.

(when running on 1 GPU the numbers are 1.2k examples a second vs 2.1k)

In my experiment I have a few questions in hopes to speed things up.

GPU utilization seems quite low when training. I noticed that in the tensorflow white paper there is support for running multiple streams on the same GPU. Is this possible in the public release?
Is there anyway to perform multiple train operations in one execution of session.run()? Or have async execution? This would allow for weight updates to be done at the same time as the next batches forward pass? I have tried using 2 threads (both system and with QueueRunners's), but this only resulted in a slowdown. MXNet is able to increase speeds by running weight updates on the CPU so that the gpu's can be used for the next batch.
Will the new distributed run time get around some of these issues by letting me run more than one worker on a single machine?
Is there something else that can be done?

I know there are a number of similar questions here on stack overflow, but though my searching I couldn't find a solution to my problems that I have not already tried.

Edit:

I did a little bit of CUDA profiling to see what the expensive kernels were. According to my run, 21.4% of the time is spent inside:

void Eigen::internal::EigenMetaKernel_NonVectorizable<Eigen::TensorEvaluator
<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=4, int=1, long>, int=16>,
Eigen::TensorPaddingOp<Eigen::array<std::pair<int, int>,
unsigned long=4> const, Eigen::TensorMap<Eigen::Tensor<float const,
int=4, int=1, long>, int=16> const > const > const, Eigen::GpuDevice>, long>(float, int=4)

and 20.0% of the time were spent in

void Eigen::internal::EigenMetaKernel_NonVectorizable<Eigen::TensorEvaluator
<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=4, int=1, long>, int=16>,
Eigen::TensorBroadcastingOp<Eigen::array<int, unsigned long=4>
const, Eigen::TensorMap<Eigen::Tensor<float const, int=4, int=1, long>,
int=16> const > const > const, Eigen::GpuDevice>, long>(float, int=4)

Off of the Signature I am not exactly sure what these are doing. Do these make sense?

In addition to this, the analysis reports low kernel concurrency, 0%, as expected. And Low compute utilization 34.9% (granted this includes start-up time and a little bit of python in train loop. Around 32 seconds total out of 91. This comes out to around 50% utilization inside tensorflow.)

Edit 2:

I have attached a copy of the trimmed down source code. In general though I am more concerned about question 1-3 and don't want to take too much of ever bodies time.

In addition I am running on tensorflow built from: f07234db2f7b316b08f7df25417245274b63342a

Edit 3:

Updated to the most recent tensorflow (63409bd23facad471973b110df998782c0e19c06) same code, default data format (NHWC) and that seemed to speed this up a lot. On fake data 6.7k-6.8k (thermal dependence I think?) examples a second 4gpu. 1gpu -- 2.0k examples a second. Real data performance is around 4.9k examples a second for 4gpu. 1gpu -- 1.7k examples a second.

Edit 4:

In addition I tried out switching data formats to BCHW. I made the conversion modelled off of Soumith's benchmarks. The convolution parts were indeed faster, but batch norm appears to be messing everything up. With a naive implementation (fixing axis, and making weights [1,C,1,1] instead of [C,]) I am only able to get 1.2k examples a second on 4 gpu (fake data). Where as with a transpose before and after the batch norm op I am able to get 6.2k examples a second (fake data). Still slower than the NHWC data_format.

752

asked Mar 16 '16 22:03

luke

1 Answers

It's a bit hard to diagnose your program's performance problem without seeing the code. Is it possible for us to read your test code somehow?

TensorPadding showing on the top is a bit strange. I'd expect cudnn calls should be on the top of the profile. Anyway, showing us the test code will be helpful.

111

answered Oct 17 '22 12:10

zfc

Related questions
                            
                                convert c++ header file to protobuf .proto file
                            
                                try {.... } catch(..) only if a certain compile time expression is true
                            
                                Why the requirement for custom allocators to be copyconstructible?
                            
                                OpenCV find the text Scale from a size
                            
                                How to get the definition of a macro as a string literal?
                            
                                Compare and swap in C++
                            
                                Unexpected snprintf behaviour
                            
                                compile win32 library from exprtk
                            
                                Steam for Linux platform libraries causing Qt application misbehavior
                            
                                Why Does This Auto-Vectorizer Care About Constructors/Destructors?
                            
                                Static build of Qt Qt5Network linking error
                            
                                Custom Title Bar Color for native C++ app on Windows 10
                            
                                Enforce "noexcept" on std::function?
                            
                                Why is this value printed although being NaN?
                            
                                Detect if struct has padding
                            
                                Forwarding cv-ref-qualifier for member functions
                            
                                Developing a custom virtual keyboard for Windows 10
                            
                                How to navigate to source code in linked libraries in Clion?
                            
                                Incorrect double to long conversion
                            
                                Using shared_ptr and glutInit causes segmentation fault

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With