Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nvidia NPP nppiFilter produces garbage when convolving with 2d kernel

Nvidia Performance Primitives (NPP) provides the nppiFilter function for convolving a user-provided image with a user-provided kernel. For 1D convolution kernels, nppiFilter works properly. However, nppiFilter is producing a garbage image for 2D kernels.

I used the typical Lena image as input: enter image description here


Here's my experiment with a 1D convolution kernel, which produces good output.

#include <npp.h> // provided in CUDA SDK
#include <ImagesCPU.h> // these image libraries are also in CUDA SDK
#include <ImagesNPP.h>
#include <ImageIO.h>

void test_nppiFilter()
{
    npp::ImageCPU_8u_C1 oHostSrc;
    npp::loadImage("Lena.pgm", oHostSrc);
    npp::ImageNPP_8u_C1 oDeviceSrc(oHostSrc); // malloc and memcpy to GPU 
    NppiSize kernelSize = {3, 1}; // dimensions of convolution kernel (filter)
    NppiSize oSizeROI = {oHostSrc.width() - kernelSize.width + 1, oHostSrc.height() - kernelSize.height + 1};
    npp::ImageNPP_8u_C1 oDeviceDst(oSizeROI.width, oSizeROI.height); // allocate device image of appropriately reduced size
    npp::ImageCPU_8u_C1 oHostDst(oDeviceDst.size());
    NppiPoint oAnchor = {2, 1}; // found that oAnchor = {2,1} or {3,1} works for kernel [-1 0 1] 
    NppStatus eStatusNPP;

    Npp32s hostKernel[3] = {-1, 0, 1}; // convolving with this should do edge detection
    Npp32s* deviceKernel;
    size_t deviceKernelPitch;
    cudaMallocPitch((void**)&deviceKernel, &deviceKernelPitch, kernelSize.width*sizeof(Npp32s), kernelSize.height*sizeof(Npp32s));
    cudaMemcpy2D(deviceKernel, deviceKernelPitch, hostKernel,
                     sizeof(Npp32s)*kernelSize.width, // sPitch
                     sizeof(Npp32s)*kernelSize.width, // width
                     kernelSize.height, // height
                     cudaMemcpyHostToDevice);
    Npp32s divisor = 1; // no scaling

    eStatusNPP = nppiFilter_8u_C1R(oDeviceSrc.data(), oDeviceSrc.pitch(),
                                          oDeviceDst.data(), oDeviceDst.pitch(),
                                          oSizeROI, deviceKernel, kernelSize, oAnchor, divisor);

    cout << "NppiFilter error status " << eStatusNPP << endl; // prints 0 (no errors)
    oDeviceDst.copyTo(oHostDst.data(), oHostDst.pitch()); // memcpy to host
    saveImage("Lena_filter_1d.pgm", oHostDst); 
}

Output of the above code with kernel [-1 0 1] -- it looks like a reasonable gradient image: enter image description here


However, nppiFilter outputs a garbage image if I use a 2D convolution kernel. Here are the things that I changed from the above code to run with the 2D kernel [-1 0 1; -1 0 1; -1 0 1]:

NppiSize kernelSize = {3, 3};
Npp32s hostKernel[9] = {-1, 0, 1, -1, 0, 1, -1, 0, 1};
NppiPoint oAnchor = {2, 2}; // note: using anchor {1,1} or {0,0} causes error -24 (NPP_TEXTURE_BIND_ERROR)
saveImage("Lena_filter_2d.pgm", oHostDst);

Below is the output image using the 2D kernel [-1 0 1; -1 0 1; -1 0 1].

What am I doing wrong?

enter image description here

This StackOverflow post describes a similar problem, as shown in user Steenstrup's image: http://1ordrup.dk/kasper/image/Lena_boxFilter5.jpg


A few final notes:

  • With the 2D kernel, for certain anchor values (e.g. NppiPoint oAnchor = {0, 0} or {1, 1}), I get error -24, which translates to NPP_TEXTURE_BIND_ERROR according to the NPP User Guide. This issue was mentioned briefly in this StackOverflow post.
  • This code is very verbose. This isn't the main question, but does anyone have any suggestions for how to make this code more concise?
like image 723
solvingPuzzles Avatar asked Nov 04 '22 14:11

solvingPuzzles


1 Answers

You are using a 2D memory allocator for the kernel array. Kernel arrays are dense 1D arrays, not 2D strided arrays as the typical NPP image is.

Simply replace the 2D CUDA malloc with a simple cuda malloc of size kernelWidth*kernelHeight*sizeof(Npp32s) and do a normal CUDA memcopy not memcopy 2D.

//1D instead of 2D
cudaMalloc((void**)&deviceKernel, kernelSize.width * kernelSize.height * sizeof(Npp32s));
cudaMemcpy(deviceKernel, hostKernel, kernelSize.width * kernelSize.height * sizeof(Npp32s), cudaMemcpyHostToDevice);

As an aside, a "scale factor" of 1 does not translate to no scaling. Scaling happens with factors 2^(-ScaleFactor).

like image 173
Frank Jargstorff Avatar answered Nov 08 '22 08:11

Frank Jargstorff