Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C vs assembler vs NEON performance

I am working on an iPhone application that does real time image processing. One of the earliest steps in its pipeline is to convert a BGRA image to greyscale. I tried several different methods and the difference in timing results is far greater than I had imagined possible. First I tried using C. I approximate the conversion to luminosity by adding B+2*G+R /4

void BGRA_To_Byte(Image<BGRA> &imBGRA, Image<byte> &imByte)
{
uchar *pIn = (uchar*) imBGRA.data;
uchar *pLimit = pIn + imBGRA.MemSize();

uchar *pOut = imByte.data;
for(; pIn < pLimit; pIn+=16)   // Does four pixels at a time
{
    unsigned int sumA = pIn[0] + 2 * pIn[1] + pIn[2];
    pOut[0] = sumA / 4;
    unsigned int sumB = pIn[4] + 2 * pIn[5] + pIn[6];
    pOut[1] = sumB / 4;
    unsigned int sumC = pIn[8] + 2 * pIn[9] + pIn[10];
    pOut[2] = sumC / 4;
    unsigned int sumD = pIn[12] + 2 * pIn[13] + pIn[14];
    pOut[3] = sumD / 4;
    pOut +=4;
}       
}

This code takes 55 ms to convert a 352x288 image. I then found some assembler code that does essentially the same thing

void BGRA_To_Byte(Image<BGRA> &imBGRA, Image<byte> &imByte)
{
uchar *pIn = (uchar*) imBGRA.data;
uchar *pLimit = pIn + imBGRA.MemSize();

unsigned int *pOut = (unsigned int*) imByte.data;

for(; pIn < pLimit; pIn+=16)   // Does four pixels at a time
{
  register unsigned int nBGRA1 asm("r4");
  register unsigned int nBGRA2 asm("r5");
  unsigned int nZero=0;
  unsigned int nSum1;
  unsigned int nSum2;
  unsigned int nPacked1;
  asm volatile(
           
               "ldrd %[nBGRA1], %[nBGRA2], [ %[pIn], #0]       \n"   // Load in two BGRA words
               "usad8 %[nSum1], %[nBGRA1], %[nZero]  \n"  // Add R+G+B+A 
               "usad8 %[nSum2], %[nBGRA2], %[nZero]  \n"  // Add R+G+B+A 
               "uxtab %[nSum1], %[nSum1], %[nBGRA1], ROR #8    \n"   // Add G again
               "uxtab %[nSum2], %[nSum2], %[nBGRA2], ROR #8    \n"   // Add G again
               "mov %[nPacked1], %[nSum1], LSR #2 \n"    // Init packed word   
               "mov %[nSum2], %[nSum2], LSR #2 \n"   // Div by four
               "add %[nPacked1], %[nPacked1], %[nSum2], LSL #8 \n"   // Add to packed word                 

               "ldrd %[nBGRA1], %[nBGRA2], [ %[pIn], #8]       \n"   // Load in two more BGRA words
               "usad8 %[nSum1], %[nBGRA1], %[nZero]  \n"  // Add R+G+B+A 
               "usad8 %[nSum2], %[nBGRA2], %[nZero]  \n"  // Add R+G+B+A 
               "uxtab %[nSum1], %[nSum1], %[nBGRA1], ROR #8    \n"   // Add G again
               "uxtab %[nSum2], %[nSum2], %[nBGRA2], ROR #8    \n"   // Add G again
               "mov %[nSum1], %[nSum1], LSR #2 \n"   // Div by four
               "add %[nPacked1], %[nPacked1], %[nSum1], LSL #16 \n"   // Add to packed word
               "mov %[nSum2], %[nSum2], LSR #2 \n"   // Div by four
               "add %[nPacked1], %[nPacked1], %[nSum2], LSL #24 \n"   // Add to packed word                 
              
               ///////////
               ////////////
               
               : [pIn]"+r" (pIn), 
         [nBGRA1]"+r"(nBGRA1),
         [nBGRA2]"+r"(nBGRA2),
         [nZero]"+r"(nZero),
         [nSum1]"+r"(nSum1),
         [nSum2]"+r"(nSum2),
         [nPacked1]"+r"(nPacked1)
               :
               : "cc"  );
  *pOut = nPacked1;
  pOut++;
 }
 }

This function converts the same image in 12ms, almost 5X faster! I have not programmed in assembler before but I assumed that it would not be this much faster than C for such a simple operation. Inspired by this success I continued searching and discovered a NEON conversion example here.

void greyScaleNEON(uchar* output_data, uchar* input_data, int tot_pixels)
{
__asm__ volatile("lsr          %2, %2, #3      \n"
                 "# build the three constants: \n"
                 "mov         r4, #28          \n" // Blue channel multiplier
                 "mov         r5, #151         \n" // Green channel multiplier
                 "mov         r6, #77          \n" // Red channel multiplier
                 "vdup.8      d4, r4           \n"
                 "vdup.8      d5, r5           \n"
                 "vdup.8      d6, r6           \n"
                 "0:                           \n"
                 "# load 8 pixels:             \n"
                 "vld4.8      {d0-d3}, [%1]!   \n"
                 "# do the weight average:     \n"
                 "vmull.u8    q7, d0, d4       \n"
                 "vmlal.u8    q7, d1, d5       \n"
                 "vmlal.u8    q7, d2, d6       \n"
                 "# shift and store:           \n"
                 "vshrn.u16   d7, q7, #8       \n" // Divide q3 by 256 and store in the d7
                 "vst1.8      {d7}, [%0]!      \n"
                 "subs        %2, %2, #1       \n" // Decrement iteration count
                 "bne         0b            \n" // Repeat unil iteration count is not zero
                 :
                 :  "r"(output_data),           
                 "r"(input_data),           
                 "r"(tot_pixels)        
                 : "r4", "r5", "r6"
                 );
}

The timing results were hard to believe. It converts the same image in 1 ms. 12X faster than assembler and an astounding 55X faster than C. I had no idea that such performance gains were possible. In light of this I have a few questions. First off, am I doing something terribly wrong in the C code? I still find it hard to believe that it is so slow. Second, if these results are at all accurate, in what kinds of situations can I expect to see these gains? You can probably imagine how excited I am at the prospect of making other parts of my pipeline run 55X faster. Should I be learning assembler/NEON and using them inside any loop that takes an appreciable amount of time?

Update 1: I have posted the assembler output from my C function in a text file at http://temp-share.com/show/f3Yg87jQn It was far too large to include directly here.

Timing is done using OpenCV functions.

double duration = static_cast<double>(cv::getTickCount()); 
//function call 
duration = static_cast<double>(cv::getTickCount())-duration;
duration /= cv::getTickFrequency();
//duration should now be elapsed time in ms

Results

I tested several suggested improvements. First, as recommended by Viktor I reordered the inner loop to put all fetches first. The inner loop then looked like.

for(; pIn < pLimit; pIn+=16)   // Does four pixels at a time
{     
  //Jul 16, 2012 MR: Read and writes collected
  sumA = pIn[0] + 2 * pIn[1] + pIn[2];
  sumB = pIn[4] + 2 * pIn[5] + pIn[6];
  sumC = pIn[8] + 2 * pIn[9] + pIn[10];
  sumD = pIn[12] + 2 * pIn[13] + pIn[14];
  pOut +=4;
  pOut[0] = sumA / 4;
  pOut[1] = sumB / 4;
  pOut[2] = sumC / 4;
  pOut[3] = sumD / 4;
}

This change brought processing time down to 53ms an improvement of 2ms. Next as recommended by Victor I changed my function to fetch as uint. The inner loop then looked like

unsigned int* in_int = (unsigned int*) original.data;
unsigned int* end = (unsigned int*) in_int + out_length;
uchar* out = temp.data;

for(; in_int < end; in_int+=4)   // Does four pixels at a time
{
    unsigned int pixelA = in_int[0];
    unsigned int pixelB = in_int[1];
    unsigned int pixelC = in_int[2];
    unsigned int pixelD = in_int[3];
        
    uchar* byteA = (uchar*)&pixelA;
    uchar* byteB = (uchar*)&pixelB;
    uchar* byteC = (uchar*)&pixelC;
    uchar* byteD = (uchar*)&pixelD;         
        
    unsigned int sumA = byteA[0] + 2 * byteA[1] + byteA[2];
    unsigned int sumB = byteB[0] + 2 * byteB[1] + byteB[2];
    unsigned int sumC = byteC[0] + 2 * byteC[1] + byteC[2];
    unsigned int sumD = byteD[0] + 2 * byteD[1] + byteD[2];

    out[0] = sumA / 4;
    out[1] = sumB / 4;
    out[2] = sumC / 4;
    out[3] = sumD / 4;
    out +=4;
    }

This modification had a dramatic effect, dropping processing time to 14ms, a drop of 39ms (75%). This last result is very close the the assembler performance of 11ms. The final optimization as recommended by rob was to include the __restrict keyword. I added it in front of every pointer declaration changing the following lines

__restrict unsigned int* in_int = (unsigned int*) original.data;
unsigned int* end = (unsigned int*) in_int + out_length;
__restrict uchar* out = temp.data;  
...
__restrict uchar* byteA = (uchar*)&pixelA;
__restrict uchar* byteB = (uchar*)&pixelB;
__restrict uchar* byteC = (uchar*)&pixelC;
__restrict uchar* byteD = (uchar*)&pixelD;  
...     

These changes had no measurable effect on processing time. Thank you for all your help, I will be paying much closer attention to memory management in the future.

like image 609
Hammer Avatar asked Jul 16 '12 16:07

Hammer


3 Answers

There is an explanation here concerning some of the reasons for NEON's "success": http://hilbert-space.de/?p=22

Try compiling you C code with the "-S -O3" switches to see the optimized output of the GCC compiler.

IMHO, the key to success is the optimized read/write pattern employed by both assembly versions. And NEON/MMX/other vector engines also support saturation (clamping results to 0..255 without having to use the 'unsigned ints').

See these lines in the loop:

unsigned int sumA = pIn[0] + 2 * pIn[1] + pIn[2];
pOut[0] = sumA / 4;
unsigned int sumB = pIn[4] + 2 * pIn[5] + pIn[6];
pOut[1] = sumB / 4;
unsigned int sumC = pIn[8] + 2 * pIn[9] + pIn[10];
pOut[2] = sumC / 4;
unsigned int sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut[3] = sumD / 4;
pOut +=4;

The reads and writes are really mixed. Slightly better version of the loop's cycle would be

// and the pIn reads can be combined into a single 4-byte fetch
sumA = pIn[0] + 2 * pIn[1] + pIn[2];
sumB = pIn[4] + 2 * pIn[5] + pIn[6];
sumC = pIn[8] + 2 * pIn[9] + pIn[10];
sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut +=4;
pOut[0] = sumA / 4;
pOut[1] = sumB / 4;
pOut[2] = sumC / 4;
pOut[3] = sumD / 4;

Keep in mind, that the "unsigned in sumA" line here can really mean the alloca() call (allocation on the stack), so you're wasting a lot of cycles on the temporary var allocations (the function call 4 times).

Also, the pIn[i] indexing does only a single-byte fetch from memory. The better way to do this is to read the int and then extract single bytes. To make things faster, use the "unsgined int*" to read 4 bytes (pIn[i * 4 + 0], pIn[i * 4 + 1], pIn[i * 4 + 2], pIn[i * 4 + 3]).

The NEON version is clearly superior: the lines

             "# load 8 pixels:             \n"
             "vld4.8      {d0-d3}, [%1]!   \n"

and

             "#save everything in one shot   \n"
             "vst1.8      {d7}, [%0]!      \n"

save most of the time for the memory access.

like image 120
Viktor Latypov Avatar answered Nov 14 '22 08:11

Viktor Latypov


If performance is critically important (as it generally is with real-time image processing), you do need to pay attention to the machine code. As you have discovered, it can be especially important to use the vector instructions (which are designed for things like real-time image processing) -- and it is hard for compilers to automatically use the vector instructions effectively.

What you should try, before committing to assembly, is using compiler intrinsics. Compiler intrinsics aren't any more portable than assembly, but they should be easier to read and write, and easier for the compiler to work with. Aside from maintainability problems, the performance problem with assembly is that it effectively turns off the optimizer (you did use the appropriate compiler flag to turn it on, right?). That is: with inline assembly, the compiler is not able to tweak register assignment and so forth, so if you don't write your entire inner loop in assembly, it may still not be as efficient as it could be.

However, you will still be able to use your newfound assembly expertise to good effect -- as you can now inspect the assembly produced by your compiler, and figure out if it's being stupid. If so, you can tweak the C code (perhaps doing some pipelining by hand if the compiler isn't managing to), recompile it, look at the assembly output to see if the compiler is now doing what you want it to, then benchmark to see if it's actually running any faster...

If you've tried the above, and still can't provoke the compiler to do the right thing, go ahead and write your inner loop in assembly (and, again, check to see if the result is actually faster). For reasons described above, be sure to get the entire inner loop, including the loop branch.

Finally, as others have mentioned, take some time to try and figure out what "the right thing" is. Another benefit of learning your machine architecture is that it gives you a mental model of how things work -- so you will have a better chance of understanding how to put together efficient code.

like image 4
comingstorm Avatar answered Nov 14 '22 07:11

comingstorm


Viktor Latypov's answer has lots of good information, but I want to point out one more thing: in your original C function, the compiler can't tell that pIn and pOut point to non-overlapping regions of memory. Now look at these lines:

pOut[0] = sumA / 4;
unsigned int sumB = pIn[4] + 2 * pIn[5] + pIn[6];

The compiler has to assume that pOut[0] might be the same as pIn[4] or pIn[5] or pIn[6] (or any other pIn[x]). So it basically can't reorder any of the code in your loop.

You can tell the compiler that pIn and pOut don't overlap by declaring them __restrict:

__restrict uchar *pIn = (uchar*) imBGRA.data;
__restrict uchar *pOut = imByte.data;

This might speed up your original C version a bit.

like image 3
rob mayoff Avatar answered Nov 14 '22 08:11

rob mayoff