I've been struggling to get OpenCV CUDA to improve performance for things like erode/dilate, frame differencing etc when i read in a video from an avi file. typical i get half the FPS on the GPU (580gtx) than on the CPU (AMD 955BE). Before u ask if i'm measuring fps correctly, you can clearly see the lag on the GPU with the naked eye especially when using a high erode/dilate level.
It seems that i'm not reading in the frames in parallel?? Here is the code:
#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/video/tracking.hpp>
#include <opencv2/gpu/gpu.hpp>
#include <stdlib.h>
#include <stdio.h>
using namespace cv;
using namespace cv::gpu;
Mat cpuSrc;
GpuMat src, dst;
int element_shape = MORPH_RECT;
//the address of variable which receives trackbar position update
int max_iters = 10;
int open_close_pos = 0;
int erode_dilate_pos = 0;
// callback function for open/close trackbar
void OpenClose(int)
{
IplImage disp;
Mat temp;
int n = open_close_pos - max_iters;
int an = n > 0 ? n : -n;
Mat element = getStructuringElement(element_shape, Size(an*2+1, an*2+1), Point(an, an) );
if( n < 0 )
cv::gpu::morphologyEx(src, dst, CV_MOP_OPEN, element);
else
cv::gpu::morphologyEx(src, dst, CV_MOP_CLOSE, element);
dst.download(temp);
disp = temp;
// cvShowImage("Open/Close",&disp);
}
// callback function for erode/dilate trackbar
void ErodeDilate(int)
{
IplImage disp;
Mat temp;
int n = erode_dilate_pos - max_iters;
int an = n > 0 ? n : -n;
Mat element = getStructuringElement(element_shape, Size(an*2+1, an*2+1), Point(an, an) );
if( n < 0 )
cv::gpu::erode(src, dst, element);
else
cv::gpu::dilate(src, dst, element);
dst.download(temp);
disp = temp;
cvShowImage("Erode/Dilate",&disp);
}
int main( int argc, char** argv )
{
VideoCapture capture("TwoManLoiter.avi");
//create windows for output images
namedWindow("Open/Close",1);
namedWindow("Erode/Dilate",1);
open_close_pos = 3;
erode_dilate_pos = 0;
createTrackbar("iterations", "Open/Close",&open_close_pos,max_iters*2+1,NULL);
createTrackbar("iterations", "Erode/Dilate",&erode_dilate_pos,max_iters*2+1,NULL);
for(;;)
{
capture >> cpuSrc;
src.upload(cpuSrc);
GpuMat grey;
cv::gpu::cvtColor(src, grey, CV_BGR2GRAY);
src = grey;
int c;
ErodeDilate(erode_dilate_pos);
c = cvWaitKey(25);
if( (char)c == 27 )
break;
}
return 0;
}
The CPU implementation is the same minus using namespace cv::gpu and the Mat instead of GpuMat of course.
Thanks
OpenCV includes GPU module that contains all GPU accelerated stuff.
By default, each of the OpenCV CUDA algorithms uses a single GPU. If you need to utilize multiple GPUs, you have to manually distribute the work between GPUs.
Building OpenCV without CUDA support does not perform device code compilation, so it does not require the CUDA Toolkit installed.
My guess would be, that the performance gain from the GPU erode/dilate is overweighted by the memory operations of transferring the image to and from the GPU every frame. Keep in mind that memory bandwidth is a crucial factor in GPGPU algorithms, and even more the bandwidth between CPU and GPU.
EDIT: To optimize it you might write your own image display routine (instead of cvShowImage) that uses OpenGL and just displays the image as an OpenGL texture. In this case you don't need to read the processed image from the GPU back to CPU and you can directly use an OpenGL texture/buffer as a CUDA image/buffer, so you don't even need to copy the image inside the GPU. But in this case you might have to manage CUDA resources yourself. With this method you might also use PBOs to upload the video into the texture and profit a bit from asynchronity.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With