Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

New posts in micro-optimization

Can the compiler/JIT optimize away short-circuit evaluation if there are no side-effects?

Understanding a specific CIL / CLR optimization

Fastest way to take the average of two signed integers in x86 assembly?

Why do C compilers still prefer push over mov for saving registers, even when mov appears faster in llvm-mca?

Is the fall-through side of a conditional branch more efficient? Is it a good idea to make that the error-handling side?

Efficient UTF-8 character-length decoding for a non-zero character in a 32 bit register

Advantage of using LEA over MOV for passing parameters in Assembly compiled from C++

Is there a faster algorithm for max(ctz(x), ctz(y))?

repz ret: why all the hassle?

Why _umul128 works slower than scalar code for mul128x64x2 function?

How to implement the totalOrder predicate for floating point values?

ARM Cortex M0+: How to use "Branch if Carry" instructions in C-code?

can array access be optimized?

x86 Assembly pushad/popad, How fast it is?

Optimize nested loops for pattern-filling an array, to help the compiler produce efficient ARM assembly?

Why this unnecessary MOVAPD copy in gcc 9.1, in a tiny function

x86 opcode alignment references and guidelines

How to get lg2 of a number that is 2^k

Why is my operator ++ more than twice as fast as its equivalent instance method?

_mm256_fmadd_ps is slower than _mm256_mul_ps + _mm256_add_ps?