Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between memory fetch with and without offset on Intel

Appel explains in "Runtime Tags Aren't Necessary" on page 8 how to distinguish integers from pointers by tagging pointers:

Some implementations use a low-order tag of 0 for integers, then integer addition can then be done with the ordinary machine add instruction, and no shifting or correction will be necessary (since 2x + 2y = 2(x + y)). This requires that pointers have a tag of 1; but pointer-fetches can be done with odd offsets to compensate.

The idea is: if a pointer is aligned, the value is a multiple of 2 or 4. And in that case the lower 1 or 2 bits are always zero and can be set to some value to implement a tagging to distinguish integers from pointers.

An untagged pointer fetch without offset in Intel syntax is:

mov    eax, DWORD PTR [ebx]

And the equivalent tagged pointer fetch with offset is this:

mov    eax, DWORD PTR [ebx-0x1] 

What is the difference in cycles for the two fetches?

like image 350
ceving Avatar asked Jun 07 '26 20:06

ceving


1 Answers

The complexity of the addressing mode generally has no impact on the throughput of load instructions, but it may have an impact of 1 cycle on the latency1.

In particular, a simple addressing mode, which is [base] or [base + offset] where offset < 2048 usually takes 4 cycles, while complex modes (that's anything that isn't simple) take 5 cycles. That's for loads into general purpose registers: for vector loads you usually add 1 or 2 more cycles.

So in your case, you are using only base with a very small offset, so you should get the fastest load latency of 4 cycles.

This applies to Intel, I'm not sure about AMD.

Details are in the Intel optimization guide, but here's the source I could find most quickly.

As Ross mentions in the comments, there is at least one more minor downside to using the offset: the instruction is one byte longer for the version with an offset (and would be 4 bytes longer if your offset is outside the range -128 to 127), which slightly increases pressure on the icache.


1 It goes without saying that this is for hits in L1. If you miss L1, latency will be longer - perhaps much longer and it probably doesn't matter if you still pay an extra cycle in that case (but I suppose you do, on average, since the miss doesn't get started until the address is calculated).

like image 102
BeeOnRope Avatar answered Jun 10 '26 10:06

BeeOnRope