Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

New posts in micro-optimization

Missing latency in instruction tables

What is the fastest way to swap the bytes of an unaligned 64 bit value in memory?

Why is clang's `-O3` alloca 2x faster than g++

Is there any data on the latency of an AVX2 gather instruction?

Why don't cases in switch statements have their own scope? [duplicate]

Which is generally faster to test for zero in x86 ASM: "TEST EAX, EAX" versus "TEST AL, AL"?

Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?

Does adding local variables to methods make them slower?

What nonlocal effects can change the performance of a basic block?

Why jnz requires 2 cycles to complete in an inner loop

Loading an entire cache line at once to avoid contention for multiple elements of it

Does it cost significant resources for a modern CPU to keep flags updated?

How to write a custom exception class derived from std::invalid_argument?

What is the difference between _mm_movehdup_ps and _mm_shuffle_ps in this case?

How to properly increment some array key, even if key needs to be created?

Shortest Intel x86-64 opcode for rax=1?

Optimize C# Code Fragment

How to optimize DivMod for a constant divisor of 10

Efficient modulo-255 computation

Efficient computation of the average of three unsigned integers (without overflow)