I have trained a <code>faster_rcnn_inception_resnet_v2_atrous_coco</code> model (available here) for custom object Detection. For prediction, I used object detection demo jupyter notebook file on my images. Also checked the time consumed on each step and found that <code>sess.run</code> was taking all the time. But it takes around 25-40 [sec] to predict an image of (3000 x 2000) pixel size ( around 1-2 [MB] ) on GPU. Can anyone figure out the problem here? I have performed profiling, link to download profiling file Link to full profiling System information: Training and Prediction on Virtual Machine created in Azure portal with Standard_NV6 (details here) which uses NVIDIA Tesla M60 GPU <ul> <li>OS Platform and Distribution - Windows 10 </li> <li>TensorFlow installed from - Using pip <code>pip3 install --upgrade tensorflow-gpu</code> </li> <li>TensorFlow version - 1.8.0 </li> <li>Python version - 3.6.5 </li> <li>GPU/CPU - GPU </li> <li>CUDA/cuDNN version - CUDA 9/cuDNN 7 </li> </ul>

As the website says the image size should be 600x600 and the code ran on Nvidia GeForce GTX TITAN X card. But first please make sure your code is actually running on GPU and not on CPU. I suggest running your code and opening another window to see GPU utilization using command below and see if anything changes. <pre class="prettyprint"><code>watch nvidia-smi </code></pre>

TensorFlow takes long time for initial setup. ( Don't worry. It is just a one time process ). Loading the graph is a heavy process. I executed this code in my CPU. It took almost 40 seconds to complete the program. The time taken for initial set up like loading the graph was 37 seconds. The actual time taken for performing object detection was 3 seconds, i.e. 1.5 seconds per image. If I had given 100 images then the total time taken would be 37 + 1.5 * 100. I don't have to load the graph 100 times. So in your case, if that took 25 [s], then the initial setup would have taken ~ 23-24 [s]. The actual time should be ~ 1-2 [s]. You can verify it in the code. May use the <code>time</code> module in python: <pre class="prettyprint"><code>import time # used to obtain time stamps for image_path in TEST_IMAGE_PATHS: # iteration of images for detection # ------------------------------ # begins here start = time.time() # saving current timestamp ... ... ... plt.imshow( image_np ) # ------------------------------ # processing one image ends here print( 'Time taken', time.time() - start # calculating the time it has taken ) </code></pre>

Why so low Prediction Rate 25 - 40 [sec/1] using Faster RCNN for custom object detection on GPU?

3 Answers

Can anyone figure out the problem here ?

Sorry for being here brutally opened & straight fair
on
where the root-cause of the observed performance problem is :

One could not find a worse VM-setup from Azure portfolio for such a computing-intense ( performance-and-throughput motivated ) task. Simply could not - there is no "less" equipped option for this on the menu.

Azure NV6 is explicitly marketed for a benefit of Virtual Desktop users, where NVidia GRID^(R) driver delivers a software-layer of services for "sharing" parts of an also virtualised FrameBuffer for image/video ( desktop graphics pixels, max SP endecs ) shared, among teams of users, irrespective of their terminal device ( yet, 15 users at max per either of both on-board GPUs, for which it was specifically explicitly advertised and promoted on Azure as being it's Key Selling Point. NVidia goes even a step father, promoting this device explicitly for (cit.) Office Users ).

M60 lacks ( obviously, as having been defined such for the very different market-segment ) any smart AI / ML / DL / Tensor-processing features, having ~ 20x lower DP performance, than the AI / ML / DL / Tensor-processing specialised computing GPU devices.

enter image description here

If I may cite,

... "GRID" is the software component that lays over a given set of Tesla ( Currently M10, M6, M60 ) (and previously Quadro (K1 / K2)) GPUs. In its most basic form (if you can call it that), the GRID software is currently for creating FrameBuffer profiles when using the GPUs in "Graphics" mode, which allows users to share a portion of the GPUs FrameBuffer whilst accessing the same physical GPU.

and

No, the M10, M6 and M60 are not specifically suited for AI. However, they will work, just not as efficiently as other GPUs. NVIDIA creates specific GPUs for specific workloads and industry (technological) areas of use, as each area has different requirements._{( credits go to BJones )}

Next,
if indeed willing to spend efforts on this a-priori known worst option á la Carte :

make sure that both GPUs are in "Compute" mode, NOT "Graphics" if you're playing with AI. You can do that using the Linux Boot Utility you'll get with the correct M60 driver package after you've registered for the evaluation. _{( credits go again to BJones )}

which obviously does not seem to have such an option for a non-Linux / Azure-operated Virtualised-access devices.

Resumé :

If striving for an increased performance-and-throughput, best choose another, AI / ML / DL / Tensor-processing equipped GPU-device, where both problem-specific computing-hardware resources were put and there are no software-layers ( no GRID, or at least a disable-option easily available ), that would in any sense block achieving such advanced levels of GPU-processing performance.

158

answered Nov 01 '22 08:11

user3666197

As the website says the image size should be 600x600 and the code ran on Nvidia GeForce GTX TITAN X card. But first please make sure your code is actually running on GPU and not on CPU. I suggest running your code and opening another window to see GPU utilization using command below and see if anything changes.

watch nvidia-smi

answered Nov 01 '22 08:11

macharya

TensorFlow takes long time for initial setup. ( Don't worry. It is just a one time process ).

Loading the graph is a heavy process. I executed this code in my CPU. It took almost 40 seconds to complete the program.

The time taken for initial set up like loading the graph was 37 seconds.

The actual time taken for performing object detection was 3 seconds, i.e. 1.5 seconds per image.

If I had given 100 images then the total time taken would be 37 + 1.5 * 100. I don't have to load the graph 100 times.

So in your case, if that took 25 [s], then the initial setup would have taken ~ 23-24 [s]. The actual time should be ~ 1-2 [s].

You can verify it in the code. May use the time module in python:

import time                          # used to obtain time stamps

for image_path in TEST_IMAGE_PATHS:  # iteration of images for detection
    # ------------------------------ # begins here
    start = time.time()              # saving current timestamp
    ...
    ...
    ...
    plt.imshow( image_np )
    # ------------------------------ # processing one image ends here

print( 'Time taken',
        time.time() - start          # calculating the time it has taken
        )

answered Nov 01 '22 08:11

Sreeragh A R

Related questions
                            
                                Threading in tensorflow's input pipeline
                            
                                skipping layer in backpropagation in keras
                            
                                Tensorflow Autoencoder - How To Calculate Reconstruction Error?
                            
                                How to import Tensorflow source codes correctly with Clion or Netbeans
                            
                                Convert placeholder to constant in TensorFlow
                            
                                Hopfield Network in Keras
                            
                                It is possible to use Tensorflow C++ API in QT project?
                            
                                tf.train.get_checkpoint_state always None
                            
                                How to train a LSTM model with different N-dimensions labels?
                            
                                Same Tensorflow model giving different results on Android and Python
                            
                                Memory usage of tensorflow conv2d with large filters
                            
                                Why are my examples and labels in the wrong order?
                            
                                Same code, very different accuracy on windows/ubuntu (Keras/Tensorflow)
                            
                                What Tensorflow API to use for Seq2Seq
                            
                                How can I get a tensor output by a tensorflow.layer
                            
                                Tensorflow-gpu with pyinstaller
                            
                                Measuring time it takes to move data from RAM to GPU memory in Tensorflow
                            
                                Tensorflow MNIST Estimator: batch size affects the graph expected input?
                            
                                How to use tf.layers classes instead of functions
                            
                                Import TensorFlow data from pyspark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why so low Prediction Rate 25 - 40 [sec/1] using Faster RCNN for custom object detection on GPU?

Tags:

machine-learning

tensorflow

deep-learning

object-detection

Sachin Patel

People also ask

3 Answers

Sorry for being here brutally opened & straight fair
on
where the root-cause of the observed performance problem is :

Resumé :

user3666197

macharya

Sreeragh A R

Recent Activity

Donate For Us

Why so low Prediction Rate 25 - 40 [sec/1] using Faster RCNN for custom object detection on GPU?

Tags:

machine-learning

tensorflow

deep-learning

object-detection

Sachin Patel

People also ask

3 Answers

Sorry for being here brutally opened & straight fair onwhere the root-cause of the observed performance problem is :

Resumé :

user3666197

macharya

Sreeragh A R

Related questions

Recent Activity

Donate For Us

Sorry for being here brutally opened & straight fair
on
where the root-cause of the observed performance problem is :