Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why so low Prediction Rate 25 - 40 [sec/1] using Faster RCNN for custom object detection on GPU?

I have trained a faster_rcnn_inception_resnet_v2_atrous_coco model (available here) for custom object Detection.

For prediction, I used object detection demo jupyter notebook file on my images. Also checked the time consumed on each step and found that sess.run was taking all the time.

But it takes around 25-40 [sec] to predict an image of (3000 x 2000) pixel size ( around 1-2 [MB] ) on GPU.

Can anyone figure out the problem here?

I have performed profiling, link to download profiling file

Link to full profiling

System information:
Training and Prediction on Virtual Machine created in Azure portal with Standard_NV6 (details here) which uses NVIDIA Tesla M60 GPU

  • OS Platform and Distribution - Windows 10
  • TensorFlow installed from - Using pip pip3 install --upgrade tensorflow-gpu
  • TensorFlow version - 1.8.0
  • Python version - 3.6.5
  • GPU/CPU - GPU
  • CUDA/cuDNN version - CUDA 9/cuDNN 7
like image 367
Sachin Patel Avatar asked Jun 04 '18 08:06

Sachin Patel


People also ask

What is the difference between single-shot detection and Faster RCNN?

Single-shot detection skips the region proposal stage and yields final localisations and content prediction at once. Faster RCNN is more popular in region-based detectors. We will now see how to implement a custom object detector using Faster RCNN with PyTorch.

What is the input format of Faster RCNN detector?

The image will be in the format of [channels x height x width]. But for detection purposes, that is while giving the image as an input to the Faster RCNN detector, the input has to be 4 dimensional. We need one extra batch dimension. So, the input format will become, [batch_size x channels x height x width].

What is faster R-CNN?

R-CNN (R. Girshick et al., 2014) is the first step for Faster R-CNN. It uses search selective (J.R.R. Uijlings and al. (2012)) to find out the regions of interests and passes them to a ConvNet. It tries to find out the areas that might be an object by combining similar pixels and textures into several rectangular boxes.

How many horses does the Faster RCNN object detector detect?

Figure 3. The Faster RCNN object detector is easily able to detect the three horses in the image. The Faster RCNN network was able to detect the three horses easily. Note that the image is resized to 800×800 pixels by the detector network.


3 Answers

Can anyone figure out the problem here ?

Sorry for being here brutally opened & straight fair
on
where the root-cause of the observed performance problem is :

One could not find a worse VM-setup from Azure portfolio for such a computing-intense ( performance-and-throughput motivated ) task. Simply could not - there is no "less" equipped option for this on the menu.

Azure NV6 is explicitly marketed for a benefit of Virtual Desktop users, where NVidia GRID(R) driver delivers a software-layer of services for "sharing" parts of an also virtualised FrameBuffer for image/video ( desktop graphics pixels, max SP endecs ) shared, among teams of users, irrespective of their terminal device ( yet, 15 users at max per either of both on-board GPUs, for which it was specifically explicitly advertised and promoted on Azure as being it's Key Selling Point. NVidia goes even a step father, promoting this device explicitly for (cit.) Office Users ).

M60 lacks ( obviously, as having been defined such for the very different market-segment ) any smart AI / ML / DL / Tensor-processing features, having ~ 20x lower DP performance, than the AI / ML / DL / Tensor-processing specialised computing GPU devices.

enter image description here

If I may cite,

... "GRID" is the software component that lays over a given set of Tesla ( Currently M10, M6, M60 ) (and previously Quadro (K1 / K2)) GPUs. In its most basic form (if you can call it that), the GRID software is currently for creating FrameBuffer profiles when using the GPUs in "Graphics" mode, which allows users to share a portion of the GPUs FrameBuffer whilst accessing the same physical GPU.

and

No, the M10, M6 and M60 are not specifically suited for AI. However, they will work, just not as efficiently as other GPUs. NVIDIA creates specific GPUs for specific workloads and industry (technological) areas of use, as each area has different requirements.( credits go to BJones )

Next,
if indeed willing to spend efforts on this a-priori known worst option á la Carte :

make sure that both GPUs are in "Compute" mode, NOT "Graphics" if you're playing with AI. You can do that using the Linux Boot Utility you'll get with the correct M60 driver package after you've registered for the evaluation. ( credits go again to BJones )

which obviously does not seem to have such an option for a non-Linux / Azure-operated Virtualised-access devices.


Resumé :

If striving for an increased performance-and-throughput, best choose another, AI / ML / DL / Tensor-processing equipped GPU-device, where both problem-specific computing-hardware resources were put and there are no software-layers ( no GRID, or at least a disable-option easily available ), that would in any sense block achieving such advanced levels of GPU-processing performance.

like image 158
user3666197 Avatar answered Nov 01 '22 08:11

user3666197


As the website says the image size should be 600x600 and the code ran on Nvidia GeForce GTX TITAN X card. But first please make sure your code is actually running on GPU and not on CPU. I suggest running your code and opening another window to see GPU utilization using command below and see if anything changes.

watch nvidia-smi
like image 26
macharya Avatar answered Nov 01 '22 08:11

macharya


TensorFlow takes long time for initial setup. ( Don't worry. It is just a one time process ).

Loading the graph is a heavy process. I executed this code in my CPU. It took almost 40 seconds to complete the program.

The time taken for initial set up like loading the graph was 37 seconds.

The actual time taken for performing object detection was 3 seconds, i.e. 1.5 seconds per image.

If I had given 100 images then the total time taken would be 37 + 1.5 * 100. I don't have to load the graph 100 times.

So in your case, if that took 25 [s], then the initial setup would have taken ~ 23-24 [s]. The actual time should be ~ 1-2 [s].

You can verify it in the code. May use the time module in python:

import time                          # used to obtain time stamps

for image_path in TEST_IMAGE_PATHS:  # iteration of images for detection
    # ------------------------------ # begins here
    start = time.time()              # saving current timestamp
    ...
    ...
    ...
    plt.imshow( image_np )
    # ------------------------------ # processing one image ends here

print( 'Time taken',
        time.time() - start          # calculating the time it has taken
        )
like image 35
Sreeragh A R Avatar answered Nov 01 '22 08:11

Sreeragh A R