I've developed a little benchmark. The result of the loop inside this benchmark should be transformed to zero and the next round of my calculation should depend on the zero-"result" of the loop before to measure the latency of the code and not its througput. MOVing the result to another register and XORing itself doesn't work because today's CPUs recognize that an XOR with itself isn't dependent on the instructions before. So I tried to subtract the register from itself hoping that the CPU (Ryzen Threadripper 3990X) hasn't such a shortcut like with XOR. I evaluated this with a separate program:
#include <iostream>
#include <chrono>
using namespace std;
using namespace chrono;
int main()
{
    auto start = high_resolution_clock::now();
    for( size_t i = 1'000'000'000; i--; )
        __asm
        {
            sub     eax, eax
            sub     eax, eax
            sub     eax, eax
            sub     eax, eax
            sub     eax, eax
            sub     eax, eax
            sub     eax, eax
            sub     eax, eax
            sub     eax, eax
            sub     eax, eax
        }
    double ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / 10'000'000'000.0;
    cout << ns << endl;
}
"Unfortunately" the CPU also does a shortcut here and each instruction takes abut 0.06ns, ie. the CPU does about six sub eax, eax in each clock cycle (4,3GHz).
So is there a way to have an instruction that results in zero and that this instruction is dependent on the instruction before on a moderen CPU ?
Use an and with an immediate of zero.
and eax, 0
The instructions xor eax, eax and sub eax, eax are both recognised as zeroing idioms and won't do the trick.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With