Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Leading zeros calculation with intrinsic function

I'm trying to optimize some code working in an embedded system (FLAC decoding, Windows CE, ARM 926 MCU).

The default implementation uses a macro and a lookup table:

/* counts the # of zero MSBs in a word */
#define COUNT_ZERO_MSBS(word) ( \
 (word) <= 0xffff ? \
  ( (word) <= 0xff? byte_to_unary_table[word] + 24 : \
              byte_to_unary_table[(word) >> 8] + 16 ) : \
  ( (word) <= 0xffffff? byte_to_unary_table[word >> 16] + 8 : \
              byte_to_unary_table[(word) >> 24] ) \
)

static const unsigned char byte_to_unary_table[] = {
    8, 7, 6, 6, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
};

However most CPU already have a dedicated instruction, bsr on x86 and clz on ARM (http://www.devmaster.net/articles/fixed-point-optimizations/), that should be more efficient.

On Windows CE we have the intrinsic function _CountLeadingZeros, that should just call that value. However it is 4 times slower than the macro (measured on 10 million of iterations).

How is possible that an intrinsic function, that (should) rely on a dedicated ASM instruction, is 4 times slower?

like image 512
lornova Avatar asked Oct 28 '10 16:10

lornova


1 Answers

Check the disassembly. Are you sure that the compiler inserted the instruction? In the Remarks section there is this text:

This function can be implemented by calling a runtime function.

I suspect that's what's happening in your case.

Note that the CLZ instruction is only available in ARMv5 and later. You need to tell the compiler if you want ARMv5 code:

/QRarch5 ARM5 Architecture
/QRarch5T ARM5T Architecture

(Microsoft incorrectly uses "ARM5" instead of "ARMv5")

like image 91
Igor Skochinsky Avatar answered Nov 11 '22 22:11

Igor Skochinsky