Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bilinear interpolation in C/C++ and CUDA

Tags:

c++

cuda

I want to emulate the behavior of CUDA bilinear interpolation on CPU, but I found that the return value of tex2D seems not fit to the bilinear formula.

I guess that casting the interpolation coefficients from float to 9-bit fixed point format with 8 bits of fractional value[1] results in different values.

According to the conversion fomula [2, line 106], the result of the conversion will be the same as the input float when the coeffient is 1/2^n, with n=0,1,..., 8, but I still (not always) receive weird values.

Below I report an example of weird values. In this case, weird values always happen when id = 2*n+1, could anyone tell me why?

Src Array:

Src[0][0] =  38;  
Src[1][0] =  39;  
Src[0][1] = 118;  
Src[1][1] =  13;  

Texture Definition:

static texture<float4, 2, cudaReadModeElementType> texElnt;
texElnt.addressMode[0] = cudaAddressModeClamp;
texElnt.addressMode[1] = cudaAddressModeClamp;
texElnt.filterMode = cudaFilterModeLinear;
texElnt.normalized = false;

Kernel Function:

static __global__ void kernel_texElnt(float* pdata, int w, int h, int c, float stride/*0.03125f*/) {
    const int gx = blockIdx.x*blockDim.x + threadIdx.x;
    const int gy = blockIdx.y*blockDim.y + threadIdx.y;
    const int gw = gridDim.x * blockDim.x;
    const int gid = gy*gw + gx;
    if (gx >= w || gy >= h) {
        return;
    }

    float2 pnt;
    pnt.x = (gx)*(stride)/*1/32*/;
    pnt.y = 0.0625f/*1/16*/;

    float4 result = tex2D( texElnt, pnt.x + 0.5, pnt.y + 0.5f);
    pdata[gid*3 + 0] = pnt.x;
    pdata[gid*3 + 1] = pnt.y;
    pdata[gid*3 + 2] = result.x;

}

Bilinear Result of CUDA

id  pnt.x   pnt.y   tex2D
 0  0.00000 0.0625  43.0000000  
 1  0.03125 0.0625  42.6171875  
 2  0.06250 0.0625  42.6484375  
 3  0.09375 0.0625  42.2656250  
 4  0.12500 0.0625  42.2968750  
 5  0.15625 0.0625  41.9140625  
 6  0.18750 0.0625  41.9453125  
 7  0.21875 0.0625  41.5625000  
 8  0.25000 0.0625  41.5937500  
 9  0.28125 0.0625  41.2109375  
 0  0.31250 0.0625  41.2421875  
10  0.34375 0.0625  40.8593750  
11  0.37500 0.0625  40.8906250  
12  0.40625 0.0625  40.5078125  
13  0.43750 0.0625  40.5390625  
14  0.46875 0.0625  40.1562500  
15  0.50000 0.0625  40.1875000  
16  0.53125 0.0625  39.8046875  
17  0.56250 0.0625  39.8359375  
18  0.59375 0.0625  39.4531250  
19  0.62500 0.0625  39.4843750  
20  0.65625 0.0625  39.1015625  
21  0.68750 0.0625  39.1328125  
22  0.71875 0.0625  38.7500000  
23  0.75000 0.0625  38.7812500  
24  0.78125 0.0625  38.3984375  
25  0.81250 0.0625  38.4296875  
26  0.84375 0.0625  38.0468750  
27  0.87500 0.0625  38.0781250  
28  0.90625 0.0625  37.6953125  
29  0.93750 0.0625  37.7265625  
30  0.96875 0.0625  37.3437500  
31  1.00000 0.0625  37.3750000

CPU Result:

// convert coefficient ((1-α)*(1-β)), (α*(1-β)), ((1-α)*β), (α*β) to fixed point format  

id  pnt.x   pnt.y   tex2D
 0  0.00000 0.0625 43.00000000  
 1  0.03125 0.0625 43.23046875  
 2  0.06250 0.0625 42.64843750  
 3  0.09375 0.0625 42.87890625  
 4  0.12500 0.0625 42.29687500  
 5  0.15625 0.0625 42.52734375  
 6  0.18750 0.0625 41.94531250  
 7  0.21875 0.0625 42.17578125  
 8  0.25000 0.0625 41.59375000  
 9  0.28125 0.0625 41.82421875  
 0  0.31250 0.0625 41.24218750  
10  0.34375 0.0625 41.47265625  
11  0.37500 0.0625 40.89062500  
12  0.40625 0.0625 41.12109375  
13  0.43750 0.0625 40.53906250  
14  0.46875 0.0625 40.76953125  
15  0.50000 0.0625 40.18750000  
16  0.53125 0.0625 40.41796875  
17  0.56250 0.0625 39.83593750  
18  0.59375 0.0625 40.06640625  
19  0.62500 0.0625 39.48437500  
20  0.65625 0.0625 39.71484375  
21  0.68750 0.0625 39.13281250  
22  0.71875 0.0625 39.36328125  
23  0.75000 0.0625 38.78125000  
24  0.78125 0.0625 39.01171875  
25  0.81250 0.0625 38.42968750  
26  0.84375 0.0625 38.66015625  
27  0.87500 0.0625 38.07812500  
28  0.90625 0.0625 38.30859375  
29  0.93750 0.0625 37.72656250  
30  0.96875 0.0625 37.95703125  
31  1.00000 0.0625 37.37500000

I leave a simple code on my github [3], after running the program you will got two files in D:\.

Edit 2014/01/20

I run the program with different increments and found the specification of tex2D "when alpha multiplied beta is less than 0.00390625, the return of tex2D does not match the bilinear interpolation formula"

like image 716
user1995868 Avatar asked Jan 15 '14 03:01

user1995868


2 Answers

Already satisfactory answers have been provided to this question, so now I just want to give a compendium of hopefully useful information on bilinear interpolation, how can it be implemented in C++ and the different ways it can be done in CUDA.

Maths behind bilinear interpolation

Assume that the original function T(x, y) is sampled at the Cartesian regular grid of points (i, j) with 0 <= i < M1, 0 <= j < M2 and i and j integers. For each value of y, one can first use 0 <= a < 1 to represent an arbitrary point i + a comprised between i and i + 1. Then, a linear interpolation along the y = j axis (which is parallel to the x axis) at that point can be performed obtaining

enter image description here

where r(x,y) is the function interpolating the samples of T(x,y). The same can be done for the line y = j + 1, obtaining

enter image description here

Now, for each i + a, an interpolation along the y axis can be performed on the samples r(i+a,j) and r(i+a,j+1). Accordingly, if one uses 0 <= b < 1 to represent an arbitrary point j + b located between j and j + 1, then a linear interpolation along the x = i + a axis (which is parallel to the y axis) can be worked out, so getting the final result

enter image description here

Note that the relations between i, j, a, b, x and y are the following

enter image description here

C/C++ implementation

Let me stress that this implementation, as well as the following CUDA ones, assume, as done at the beginning, that the samples of T are located on the Cartesian regular grid of points (i, j) with 0 <= i < M1, 0 <= j < M2 and i and j integers (unit spacing). Also, the routine is provided in single precision, complex (float2) arithmetics, but it can be easily cast in other arithmetics of interest.

void bilinear_interpolation_function_CPU(float2 * __restrict__ h_result, float2 * __restrict__ h_data, 
                                         float * __restrict__ h_xout, float * __restrict__ h_yout, 
                                         const int M1, const int M2, const int N1, const int N2){

    float2 result_temp1, result_temp2;
    for(int k=0; k<N2; k++){
        for(int l=0; l<N1; l++){

            const int   ind_x = floor(h_xout[k*N1+l]); 
            const float a     = h_xout[k*N1+l]-ind_x; 

            const int   ind_y = floor(h_yout[k*N1+l]); 
            const float b     = h_yout[k*N1+l]-ind_y; 

            float2 h00, h01, h10, h11;
            if (((ind_x)   < M1)&&((ind_y)   < M2)) h00 = h_data[ind_y*M1+ind_x];       else    h00 = make_float2(0.f, 0.f);
            if (((ind_x+1) < M1)&&((ind_y)   < M2)) h10 = h_data[ind_y*M1+ind_x+1];     else    h10 = make_float2(0.f, 0.f);
            if (((ind_x)   < M1)&&((ind_y+1) < M2)) h01 = h_data[(ind_y+1)*M1+ind_x];   else    h01 = make_float2(0.f, 0.f);
            if (((ind_x+1) < M1)&&((ind_y+1) < M2)) h11 = h_data[(ind_y+1)*M1+ind_x+1]; else    h11 = make_float2(0.f, 0.f);

            result_temp1.x = a * h10.x + (-h00.x * a + h00.x); 
            result_temp1.y = a * h10.y + (-h00.y * a + h00.y);

            result_temp2.x = a * h11.x + (-h01.x * a + h01.x);
            result_temp2.y = a * h11.y + (-h01.y * a + h01.y);

            h_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
            h_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);

        }   
    }
}

The if/else statements within the above code are simply boundary checkings. If the sample falls outside the [0, M1-1] x [0, M2-1], then it is set to 0.

Standard CUDA implementation

This is a "standard" CUDA implementation tracing the above CPU one. No usage of texture memory.

__global__ void bilinear_interpolation_kernel_GPU(float2 * __restrict__ d_result, const float2 * __restrict__ d_data, 
                                                  const float * __restrict__ d_xout, const float * __restrict__ d_yout, 
                                                  const int M1, const int M2, const int N1, const int N2)
{
   const int l = threadIdx.x + blockDim.x * blockIdx.x;
   const int k = threadIdx.y + blockDim.y * blockIdx.y;

   if ((l<N1)&&(k<N2)) {

       float2 result_temp1, result_temp2;

       const int    ind_x = floor(d_xout[k*N1+l]); 
       const float  a     = d_xout[k*N1+l]-ind_x; 

       const int    ind_y = floor(d_yout[k*N1+l]); 
       const float  b     = d_yout[k*N1+l]-ind_y; 

       float2 d00, d01, d10, d11;
       if (((ind_x)   < M1)&&((ind_y)   < M2))  d00 = d_data[ind_y*M1+ind_x];       else    d00 = make_float2(0.f, 0.f);
       if (((ind_x+1) < M1)&&((ind_y)   < M2))  d10 = d_data[ind_y*M1+ind_x+1];     else    d10 = make_float2(0.f, 0.f);
       if (((ind_x)   < M1)&&((ind_y+1) < M2))  d01 = d_data[(ind_y+1)*M1+ind_x];   else    d01 = make_float2(0.f, 0.f);
       if (((ind_x+1) < M1)&&((ind_y+1) < M2))  d11 = d_data[(ind_y+1)*M1+ind_x+1]; else    d11 = make_float2(0.f, 0.f);

       result_temp1.x = a * d10.x + (-d00.x * a + d00.x); 
       result_temp1.y = a * d10.y + (-d00.y * a + d00.y);

       result_temp2.x = a * d11.x + (-d01.x * a + d01.x);
       result_temp2.y = a * d11.y + (-d01.y * a + d01.y);

       d_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
       d_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);

   } 
}

CUDA implementation with texture fetch

This is the same implementation as above, but the global memory is now accessed by the texture cache. For example, T[i,j] is accessed as

tex2D(d_texture_fetch_float,ind_x,ind_y);

(where, of course ind_x = i and ind_y = j, and d_texture_fetch_float is assumed to be a global scope variable) instead of

d_data[ind_y*M1+ind_x];

Note that the hard-wired texture filtering capabilities are not exploited here. The routine below has the same precision as the above one and could result somewhat faster than that on old CUDA architectures.

__global__ void bilinear_interpolation_kernel_GPU_texture_fetch(float2 * __restrict__ d_result, 
                                                                const float * __restrict__ d_xout, const float * __restrict__ d_yout, 
                                                                const int M1, const int M2, const int N1, const int N2)
{
   const int l = threadIdx.x + blockDim.x * blockIdx.x;
   const int k = threadIdx.y + blockDim.y * blockIdx.y;

   if ((l<N1)&&(k<N2)) {

       float2 result_temp1, result_temp2;

       const int    ind_x = floor(d_xout[k*N1+l]); 
       const float  a     = d_xout[k*N1+l]-ind_x; 

       const int    ind_y = floor(d_yout[k*N1+l]); 
       const float  b     = d_yout[k*N1+l]-ind_y; 

       const float2 d00  = tex2D(d_texture_fetch_float,ind_x,ind_y);    
       const float2 d10  = tex2D(d_texture_fetch_float,ind_x+1,ind_y);
       const float2 d11  = tex2D(d_texture_fetch_float,ind_x+1,ind_y+1);
       const float2 d01  = tex2D(d_texture_fetch_float,ind_x,ind_y+1);

       result_temp1.x = a * d10.x + (-d00.x * a + d00.x); 
       result_temp1.y = a * d10.y + (-d00.y * a + d00.y);

       result_temp2.x = a * d11.x + (-d01.x * a + d01.x);
       result_temp2.y = a * d11.y + (-d01.y * a + d01.y);

       d_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
       d_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);

   } 
}

Texture binding can be done according to

void TextureBindingBilinearFetch(const float2 * __restrict__ data, const int M1, const int M2)
{
    size_t pitch; 
    float* data_d;
    gpuErrchk(cudaMallocPitch((void**)&data_d,&pitch, M1 * sizeof(float2), M2));
    cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>();
    gpuErrchk(cudaBindTexture2D(0,&d_texture_fetch_float,data_d,&desc,M1,M2,pitch));
    d_texture_fetch_float.addressMode[0] = cudaAddressModeClamp;
    d_texture_fetch_float.addressMode[1] = cudaAddressModeClamp;
    gpuErrchk(cudaMemcpy2D(data_d,pitch,data,sizeof(float2)*M1,sizeof(float2)*M1,M2,cudaMemcpyHostToDevice));

}

Note that now we need no if/else boundary checking, because the texture will automatically clamp to zero the samples falling outside the [0, M1-1] x [0, M2-1] sampling region, thanks to the instructions

    d_texture_fetch_float.addressMode[0] = cudaAddressModeClamp;
    d_texture_fetch_float.addressMode[1] = cudaAddressModeClamp;

CUDA implementation with texture interpolation

This is the last implementation and uses the hard-wired capabilities of texture filtering.

__global__ void bilinear_interpolation_kernel_GPU_texture_interp(float2 * __restrict__ d_result, 
                                                                 const float * __restrict__ d_xout, const float * __restrict__ d_yout, 
                                                                 const int M1, const int M2, const int N1, const int N2)
{
   const int l = threadIdx.x + blockDim.x * blockIdx.x;
   const int k = threadIdx.y + blockDim.y * blockIdx.y;

   if ((l<N1)&&(k<N2)) { d_result[k*N1+l] = tex2D(d_texture_interp_float, d_xout[k*N1+l] + 0.5f, d_yout[k*N1+l] + 0.5f); }
}

Note that the interpolation formula implemented by this feature is the same as derived above, but now

enter image description here

where x_B = x - 0.5 and y_B = y - 0.5. This explains the 0.5 offset in the instruction

tex2D(d_texture_interp_float, d_xout[k*N1+l] + 0.5f, d_yout[k*N1+l] + 0.5f)

In this case, texture binding should be done as follows

void TextureBindingBilinearInterp(const float2 * __restrict__ data, const int M1, const int M2)
{
    size_t pitch; 
    float* data_d;
    gpuErrchk(cudaMallocPitch((void**)&data_d,&pitch, M1 * sizeof(float2), M2));
    cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>();
    gpuErrchk(cudaBindTexture2D(0,&d_texture_interp_float,data_d,&desc,M1,M2,pitch));
    d_texture_interp_float.addressMode[0] = cudaAddressModeClamp;
    d_texture_interp_float.addressMode[1] = cudaAddressModeClamp;
    d_texture_interp_float.filterMode = cudaFilterModeLinear;   // --- Enable linear filtering
    d_texture_interp_float.normalized = false;                  // --- Texture coordinates will NOT be normalized
    gpuErrchk(cudaMemcpy2D(data_d,pitch,data,sizeof(float2)*M1,sizeof(float2)*M1,M2,cudaMemcpyHostToDevice));

}

Note that, as already mentioned in the other answers, a and b are stored in 9-bit fixed point format with 8 bits of fractional value, so this approach will be very fast, but less accurate than those above.

like image 89
Vitality Avatar answered Sep 18 '22 14:09

Vitality


The UV interpolants are truncated to 9 bits, not the participating texel values. In Chapter 10 (Texturing) of The CUDA Handbook, this is described in detail (including CPU emulation code) for the 1D case. Code is open source and may be found at https://github.com/ArchaeaSoftware/cudahandbook/blob/master/texturing/tex1d_9bit.cu

like image 24
ArchaeaSoftware Avatar answered Sep 19 '22 14:09

ArchaeaSoftware