The stdint.h
header lacks an int_fastest_t
and uint_fastest_t
to correspond with the {,u}int_fastX_t
types. For instances where the width of the integer type does not matter, how does one pick the integer type that allows processing the greatest quantity of bits with the least penalty to performance? For example, if one was searching for the first set bit in a buffer using a naive approach, a loop such as this might be considered:
// return the bit offset of the first 1 bit
size_t find_first_bit_set(void const *const buf)
{
uint_fastest_t const *p = buf; // use the fastest type for comparison to zero
for (; *p == 0; ++p); // inc p while no bits are set
// return offset of first bit set
return (p - buf) * sizeof(*p) * CHAR_BIT + ffsX(*p) - 1;
}
Naturally, using char
would result in more operations than int
. But long long
might result in more expensive operations than the overhead of using int
on a 32 bit system and so on.
My current assumption is for the mainstream architectures, the use of long
is the safest bet: It's 32 bit on 32 bit systems, and 64 bit on 64 bit systems.
In most cases using int in a loop is more efficient than using short. My simple tests showed a performance gain of ~10% when using int.
The size of a signed int or unsigned int item is the standard size of an integer on a particular machine. For example, in 16-bit operating systems, the int type is usually 16 bits, or 2 bytes. In 32-bit operating systems, the int type is usually 32 bits, or 4 bytes.
uint_fast8_t means it's the fastest unsigned int with at least 8 bits. uint_least8_t means it's an unsigned int with at least 8 bits.
int is always 32 bits wide. sizeof(T) represents the number of 8-bit bytes (octets) needed to store a variable of type T .
int_fast8_t
is always the fastest integer type in a correct implementation. There can never be integer types smaller than 8 bits (because CHAR_BIT>=8
is required), and since int_fast8_t
is the fastest integer type with at least 8 bits, it's thus the fastest integer type, period.
Theoretically, int
is the best bet. It should map to the CPU's native register size, and thus be "optimal" in the sense you're asking about.
However, you may still find that an int-64 or int-128 is faster on some CPUs than an int-32, because although these are larger than the register size, they will reduce the number of iterations of your loop, and thus may work out more efficient by minimising the loop overheads and/or taking advantage of DMA to load/store the data faster.
(For example, on ARM-2 processors it took 4 memory cycles to load one 32-bit register, but only 5 cycles to load two sequentially, and 7 cycles to load 4 sequentially. The routine you suggest above would be optimised to use as many registers as you could free up (8 to 10 usually), and could therefore run up to 3 or 4 times faster by using multiple registers per loop iteration)
The only way to be sure is to write several routines and then profile them on the specific target machine to find out which produces the best performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With