I have trained a faster_rcnn_inception_resnet_v2_atrous_coco
model (available here) for custom object Detection.
For prediction, I used object detection demo jupyter notebook file on my images. Also checked the time consumed on each step and found that sess.run
was taking all the time.
But it takes around 25-40 [sec] to predict an image of (3000 x 2000) pixel size ( around 1-2 [MB] ) on GPU.
Can anyone figure out the problem here?
I have performed profiling, link to download profiling file
Link to full profiling
System information:
Training and Prediction on Virtual Machine created in Azure portal with Standard_NV6 (details here) which uses NVIDIA Tesla M60 GPU
pip3 install --upgrade tensorflow-gpu
Single-shot detection skips the region proposal stage and yields final localisations and content prediction at once. Faster RCNN is more popular in region-based detectors. We will now see how to implement a custom object detector using Faster RCNN with PyTorch.
The image will be in the format of [channels x height x width]. But for detection purposes, that is while giving the image as an input to the Faster RCNN detector, the input has to be 4 dimensional. We need one extra batch dimension. So, the input format will become, [batch_size x channels x height x width].
R-CNN (R. Girshick et al., 2014) is the first step for Faster R-CNN. It uses search selective (J.R.R. Uijlings and al. (2012)) to find out the regions of interests and passes them to a ConvNet. It tries to find out the areas that might be an object by combining similar pixels and textures into several rectangular boxes.
Figure 3. The Faster RCNN object detector is easily able to detect the three horses in the image. The Faster RCNN network was able to detect the three horses easily. Note that the image is resized to 800×800 pixels by the detector network.
Can anyone figure out the problem here ?
One could not find a worse VM-setup from Azure portfolio for such a computing-intense ( performance-and-throughput motivated ) task. Simply could not - there is no "less" equipped option for this on the menu.
Azure NV6 is explicitly marketed for a benefit of Virtual Desktop users, where NVidia GRID(R) driver delivers a software-layer of services for "sharing" parts of an also virtualised FrameBuffer for image/video ( desktop graphics pixels, max SP endecs ) shared, among teams of users, irrespective of their terminal device ( yet, 15 users at max per either of both on-board GPUs, for which it was specifically explicitly advertised and promoted on Azure as being it's Key Selling Point. NVidia goes even a step father, promoting this device explicitly for (cit.) Office Users ).
M60 lacks ( obviously, as having been defined such for the very different market-segment ) any smart AI / ML / DL / Tensor-processing features, having ~ 20x lower DP performance, than the AI / ML / DL / Tensor-processing specialised computing GPU devices.
If I may cite,
... "GRID" is the software component that lays over a given set of Tesla ( Currently M10, M6, M60 ) (and previously Quadro (K1 / K2)) GPUs. In its most basic form (if you can call it that), the GRID software is currently for creating FrameBuffer profiles when using the GPUs in "Graphics" mode, which allows users to share a portion of the GPUs FrameBuffer whilst accessing the same physical GPU.
and
No, the M10, M6 and M60 are not specifically suited for AI. However, they will work, just not as efficiently as other GPUs. NVIDIA creates specific GPUs for specific workloads and industry (technological) areas of use, as each area has different requirements.( credits go to BJones )
Next,
if indeed willing to spend efforts on this a-priori known worst option á la Carte :
make sure that both GPUs are in "Compute" mode, NOT "Graphics" if you're playing with AI. You can do that using the Linux Boot Utility you'll get with the correct M60 driver package after you've registered for the evaluation. ( credits go again to BJones )
which obviously does not seem to have such an option for a non-Linux / Azure-operated Virtualised-access devices.
If striving for an increased performance-and-throughput, best choose another, AI / ML / DL / Tensor-processing equipped GPU-device, where both problem-specific computing-hardware resources were put and there are no software-layers ( no GRID, or at least a disable-option easily available ), that would in any sense block achieving such advanced levels of GPU-processing performance.
As the website says the image size should be 600x600 and the code ran on Nvidia GeForce GTX TITAN X card. But first please make sure your code is actually running on GPU and not on CPU. I suggest running your code and opening another window to see GPU utilization using command below and see if anything changes.
watch nvidia-smi
TensorFlow takes long time for initial setup. ( Don't worry. It is just a one time process ).
Loading the graph is a heavy process. I executed this code in my CPU. It took almost 40 seconds to complete the program.
The time taken for initial set up like loading the graph was 37 seconds.
The actual time taken for performing object detection was 3 seconds, i.e. 1.5 seconds per image.
If I had given 100 images then the total time taken would be 37 + 1.5 * 100. I don't have to load the graph 100 times.
So in your case, if that took 25 [s], then the initial setup would have taken ~ 23-24 [s]. The actual time should be ~ 1-2 [s].
You can verify it in the code. May use the time
module in python:
import time # used to obtain time stamps
for image_path in TEST_IMAGE_PATHS: # iteration of images for detection
# ------------------------------ # begins here
start = time.time() # saving current timestamp
...
...
...
plt.imshow( image_np )
# ------------------------------ # processing one image ends here
print( 'Time taken',
time.time() - start # calculating the time it has taken
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With