Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to vectorize a 3x3 2D convolution?

I'm trying to write a optimized 3x3 2D image convolution for a 1280x720 image.

For simplicity, edge condition is approached by padding the input to 1284*724.

Here's my kernel code:

__kernel 
    __attribute__((vec_type_hint(float4)))
    void conv2d3x3(                                                     
       __global const float* restrict input,                         
       __global float* restrict output,
       __constant const float4* restrict hL, 
       /* 3x3 kernel, padded with 3 zeros on the right, used to calculate
        "left" output samples*/
       __constant const float4* restrict hR
       /*same 3x3 kernel, padded with 3 samples on the left*/)           
    {                                        
       int j = get_global_id(0)*2;  //[0,639]  
       int i = get_global_id(1)*2;  //[0,359]                

       /* load a 4x4 block, note stride is 1284 because input is padded by 4*/   
       float4 data0=vload4(0,input+1284*(i+0)+j);
       float4 data1=vload4(0,input+1284*(i+1)+j);
       float4 data2=vload4(0,input+1284*(i+2)+j);
       float4 data3=vload4(0,input+1284*(i+3)+j);

       /* sum(data[0:2,0:2].* h)*/
       float prodTL=dot(data0,hL[0])+dot(data1,hL[1])+dot(data2,hL[2]);
       /* sum(data[0:2,1:3].* h)*/
       float prodTR=dot(data0,hR[0])+dot(data1,hR[1])+dot(data2,hR[2]);
       /* sum(data[1:3,0:2].* h)*/
       float prodBL=dot(data1,hL[0])+dot(data2,hL[1])+dot(data3,hL[2]);
       /* sum(data[1:3,1:3].* h)*/
       float prodBR=dot(data1,hR[0])+dot(data2,hR[1])+dot(data3,hR[2]);

       output[1280*(i+0)+j]=prodTL;
       output[1280*(i+0)+j+1]=prodTR;
       output[1280*(i+1)+j]=prodBL;
       output[1280*(i+1)+j+1]=prodBR;
    } 

The rational of this design is, to load a 4x4 block of data, do four 3x4 convolutions and generate 4 output samples.

This code has a few obvious problems:

1) vector load is not aligned to vector boundary.

2) the storing of output is not vectorized

3) performance is poor: 3ms on Intel XEON 1245v3 with P4600(with Beignet OpenCL implenentation) and 27ms on Freescale IMX6Q with GC2000 (with Freescale OpenCL libOpenCL).

Question:

1) what I did wrong and why it's so slow?

2) what kind of performance shall I expect in terms of percentage of raw FLOPS? (p4600 is capable of 20EU * 2PFU/EU * SIMD8 = 320FLOPS/cycle running between 350MHz and 1.2GHz, while GC2000 is capable of at least 14GFLOPS)

3) in general, how to vectorize a fixed size non-separable 2D convolution without generating excessive memory traffic and cache conflict?

like image 623
user3528438 Avatar asked Jun 05 '26 01:06

user3528438


1 Answers

First, my unoptimized result:

Amd FX 8150 @3.3 GHz(32 fp elements => 1 add + 1 mul = 64 FLOPS per cycle):

3.71ms including copying time between separate opencl buffers and C# arrays.

2.05ms not including array copies.

Using 1-D ndrange kernel execution instead of 2D. [0,640x360]

__kernel 
    __attribute__((vec_type_hint(float4)))
    void bench(                                                     
       __global const float* restrict input,                         
       __global float* restrict output,
       __constant const float4* restrict hL, 
       
       __constant const float4* restrict hR
      )           
    {
                int gli=get_global_id(0); 
                int j = (gli%640) * 2 ;
                int i = (gli/640) * 2; 

                /* load a 4x4 block*/
                float4 data0 = vload4(0, input + 1280 * (i + 0) + j);
                float4 data1 = vload4(0, input + 1280 * (i + 1) + j);
                float4 data2 = vload4(0, input + 1280 * (i + 2) + j);
                float4 data3 = vload4(0, input + 1280 * (i + 3) + j);

               
                float prodTL = dot(data0, hL[0]) + dot(data1, hL[1]) + dot(data2, hL[2]);
                
                float prodTR = dot(data0, hR[0]) + dot(data1, hR[1]) + dot(data2, hR[2]);
               
                float prodBL = dot(data1, hL[0]) + dot(data2, hL[1]) + dot(data3, hL[2]);
                
                float prodBR = dot(data1, hR[0]) + dot(data2, hR[1]) + dot(data3, hR[2]);

                output[1280 * (i + 0) + j] = prodTL;
                output[1280 * (i + 0) + j + 1] = prodTR;
                output[1280 * (i + 1) + j] = prodBL;
                output[1280 * (i + 1) + j + 1] = prodBR;
            }

Host side (C# arrays):

        float[] inp = new float[1280*720*2];
        float[] outp = new float[1280*720*2];
        float[] hL = new float[1024];
        float[] hR = new float[1024];

With prefetching into private registers(I can just hope drivers using cpu registers):

2ms

Optimized part:

        float4 hl2=hL[2];
        float4 hl1=hL[1];
        float4 hl0=hL[0];

        float4 hr2=hR[2];
        float4 hr1=hR[1];
        float4 hr0=hR[0];

        float prodTL = dot(data0, hl0) + dot(data1, hl1) + dot(data2, hl2);
        
        float prodTR = dot(data0, hr0) + dot(data1, hr1) + dot(data2,  hr2);
       
        float prodBL = dot(data1, hl0) + dot(data2, hl1) + dot(data3, hl2);
        
        float prodBR = dot(data1, hr0) + dot(data2, hr1) + dot(data3,  hr2);

Now with increased parallelism on dot products:

sum of three dots will be equal to one large dot.

  float16 prodhl    =(float16)(hl0,  hl1,  hl2,  (float4)(0.0f,0.0f,0.0f,0.0f));                    
                float16 prodhl    =(float16)(hr0,  hr1,  hr2,  (float4)(0.0f,0.0f,0.0f,0.0f));   
                float16 prodTdata =(float16)(data0,data1,data2,(float4)(0.0f,0.0f,0.0f,0.0f));    

                float16 prodBdata=(float16)(data1,data2,data3,(float4)(0.0f,0.0f,0.0f,0.0f));    

                float prodTL = dot(prodTdata, prodhl);
                float prodTR = dot(prodTdata, prodhr);

                float prodBL = dot(prodBdata, prodhl);
                float prodBR = dot(prodBdata, prodhr);

Execution without any array copies:

0.5412 ms

meybe its just the AVX capability of the cpu. If not, then there should be some instruction level parallelism happening.

There is 1/4 wasted compute power in this part(the latest float4 section of float16) so there must be a way to reach 0.4 ms .

Note: thread group size was 256. I didnt try increasing to 1024 since its not suitable for all devices such as amd gpu.

You could try task process level parallelism to increase throughput and beat single opencl context (if you already doing this).

like image 150
huseyin tugrul buyukisik Avatar answered Jun 07 '26 22:06

huseyin tugrul buyukisik