I developed a naive function for mirroring an image horizontally or vertically using CUDA C++.
Then I came to know that NVIDIA Performance Primitives Library also offers a function for image mirroring .
Just for the sake of comparison, I timed my function against NPP. Surprisingly, my function outperformed (although by a small margin, but still...).
I confirmed the results several times by using Windows timer, as well as CUDA Timer.
My question is that: Aren't NPP functions completely optimized for NVIDIA GPUs?
I'm using CUDA 5.0, GeForce GTX460M (Compute 2.1), and Windows 8 for development.
I risk getting no votes by posting this answer. :)
NVIDIA continuously works to improve all of our CUDA libraries. NPP is a particularly large library, with 4000+ functions to maintain. We have a realistic goal of providing libraries with a useful speedup over a CPU equivalent, that are are tested on all of our GPUs and supported OSes, and that are actively improved and maintained. The function in question (Mirror), is a known performance issue that we will improve in a future release. If you need a particular function optimized, your best way to get it prioritized is to file an RFE bug (Request for Enhancement) using the bug submission form that is available to NVIDIA CUDA registered developers.
As an aside, I don't think any library can ever be "fully optimized". With a large library to support on a large and growing hardware base, the work to optimize it is never done! :)
We encourage folks to continue to try and outdo NVIDIA libraries, because overall it advances the state of the art and benefits the computing ecosystem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With