I am running the following piece of code:
cv::Ptr<cv::FastFeatureDetector> fastDetector = cv::FastFeatureDetector::create(100, true, 2);
cv::Ptr<cv::cuda::FastFeatureDetector> gpuFastDetector = cv::cuda::FastFeatureDetector::create(100, true, 2);
std::vector<cv::KeyPoint> keypoints;
std::vector<cv::KeyPoint> gpuKeypoints;
cv::Mat frame;
cv::cuda::GpuMat gFrame;
frame = cv::imread("image1.jpg"); // 4608 x 3456
cv::cvtColor(frame, frame, CV_BGR2GRAY);
gFrame.upload(frame);
gpuFastDetector->detect(gFrame, gpuKeypoints);
std::cout << "FAST GPU " << gpuKeypoints.size() << std::endl;
fastDetector->detect(frame, keypoints);
std::cout << "FAST " << keypoints.size() << std::endl;
And the output is:
FAST GPU 2210
FAST 3209
Question 1
Why does the same algorithm applied to the same image with same parameteres result in different number of keypoints detected?
Question 2
I am running this on Windows in Visual Studio. When using Debug configuration, the GPU detection performs faster.
But when using Release, the normal (CPU) fast detector performs faster. Moreover, The detector's performance on GPU remains the same regardless of the configuration type used. But the performance sharply increases on CPU when detection is performed under Release compared to Debug config.
(I am not running measurements on the code I presented here. I am aware that the first call to some OpenCV functions can take longer to execute because of context initialization.)
This is very likely related to my old question about the FAST detector. A plausible explanation was given by BHawk about SIMD optimizations on CPU.
So, the second question is:
Is it possible that the SIMD optimized CPU can perform FAST feature detection faster than the GPU? This seems highly unlikely.
Initialize long-winded answer :)
Question 1:
A debug compilation doesn't use the code optimizations used by a release version. The debug version will do things like retain temporary variable data so that you can read the data in the debugger. This often means that data that would normally exist temporarily in CPU registers will overflow and be copied into RAM in a debug version. The same data would be discarded when it is no longer needed in an optimized Release version. This difference might go away if you disable code optimization in your compile settings; I'm not sure I've never tried to compile without optimization before.
Question 2:
There are a few factors at play when determining whether an image process will perform better on a GPU or CPU.
1: Memory Management
The major bottleneck with GPU processing is loading data onto the GPU and retrieving it from the GPU. In the case of very large images (16 MegaPixels in your case) this bottleneck can become a significant impediment. GPUs work best when you load images onto them and then leave the images there to be manipulated and displayed via OpenGL context (as you would see in a 3D gaming engine).
2: Serial versus parallel
GPUs are made up of thousands of small processing cores that run in parallel. As such, they are able to perform lots of small tasks simultaneously. CPUs on the other hand are optimized to perform complex tasks in serial. This means some tasks (large image context, complex calculation, multi-step process) will likely perform better on a CPU than on a GPU. On the other hand, simpler tasks that use small image contexts, and don't require multiple processing steps will perform much faster on a GPU. To further complicate matters, CPUs can be threaded to run in parallel depending on the number of compute cores available. On top of that, SIMD optimized CPUs can further parallelize their processing. So a single CPU with 4 cores and an 8 SIMD ALUs can process 32 pieces of data simultaneously. This is still a far cry from the 1000s of cores present in a GPU but CPU cores usually process much faster so 4 cores with 8 SIMDs may perform faster on certain tasks. Of course the CPU speed will also scale if you go to a system with more cores or more ALUs, and decrease if you reduce the number.
Conclusions
Because of the memory bottleneck, there are some image processing tasks that aren't well suited to the GPU. The data IO negates any speed gain from the massive parallelization. In the case where you have a highly optimized, parallelized SIMD CPU algorithm, it is certainly possible that the CPU version will perform faster than the GPU due either to the nature of the algorithm and/or the data IO onto and off of the GPU. You might also find that on small images the GPU version is still slightly faster.
I would have to read through the source to see exactly how and why this specific function runs faster on CPU than GPU but I am not surprised that it does. Regarding why you get a different number of features with one implementation versus the other, that would also require a read through, but it is probably a function of altering the implementation of each differently for memory allocation or optimization purposes.
Sorry for the long answer, but it is a complicated topic of discussion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With