While writing a lookup-table related answer for another question I was reminded of something that I always wonder about: is it smart to locate a small amount of non-code data needed by a function right next to the function, instead of the traditional approach of putting it in another section? Let's say you have a small function, which uses a small, read-only, lookup table. The usual approach seems to be to locate the lookup table in a data section, such as <code>.rodata</code> which will generally place it at some distance from the text of the function itself. For example, a simple function that calculates the parity of a byte, using a 16-entry LUT: <pre class="prettyprint"><code>GLOBAL parity SECTION .text parity: mov eax, edi shr edi, 4 xor eax, edi and eax, 15 movzx eax, byte [lut + eax] ret SECTION .rodata lut: db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1 </code></pre> Now the method just happens to be about 16 bytes of code, and the lookup table is 16 bytes as well. So they could easily fit nicely in the same cache line. This seems to be like a win-win - the lut is always accessed in the function, and only accessed by the function, so we potentially reduce the cost of calling this function when cold from 2 to 1 cache misses, by putting the code and data side-by-side: <pre class="prettyprint"><code>GLOBAL parity SECTION .text parity: mov eax, edi shr edi, 4 xor eax, edi and eax, 15 movzx eax, byte [lut + eax] ret lut: db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1 </code></pre> It's the same as before, just with the table in the <code>.text</code> section immediately following the function1. As far as I know it is generally allowed in most architectures/executable formats, so what are the problems with it? For example, can the instruction fetch mechanism of the CPU get confused by fetching beyond the <code>ret</code> in my example and trying to interpret the lookup table as (nonsense) instructions? <hr> 1 Note that I deliberately put the table after the code since the code is needed first, but perhaps it doesn't matter in light of critical word first, and anyway the interaction is unclear.

One thing that I can think of is that while it makes good use of the L2 and higher level caches, it isn't as clear a win in the L1 data and instruction caches on architectures where they are split1, since the line will appear in both the L1D and L1I caches, wasting some space in each (e.g., the space used to cache the lookup table in the L1I is wasted). <hr> 1 I.e., pretty much all the main architectures today - almost everyone is modified harvard...

Is placing code and read-only data it uses right next to each other a good idea?

Tags:

performance

optimization

x86

assembly

lookup-tables

While writing a lookup-table related answer for another question I was reminded of something that I always wonder about: is it smart to locate a small amount of non-code data needed by a function right next to the function, instead of the traditional approach of putting it in another section?

Let's say you have a small function, which uses a small, read-only, lookup table. The usual approach seems to be to locate the lookup table in a data section, such as .rodata which will generally place it at some distance from the text of the function itself.

For example, a simple function that calculates the parity of a byte, using a 16-entry LUT:

GLOBAL parity

SECTION .text
parity:
  mov   eax, edi
  shr   edi, 4
  xor   eax, edi
  and   eax, 15
  movzx eax, byte [lut + eax]
  ret

SECTION .rodata
lut:
db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1

Now the method just happens to be about 16 bytes of code, and the lookup table is 16 bytes as well. So they could easily fit nicely in the same cache line. This seems to be like a win-win - the lut is always accessed in the function, and only accessed by the function, so we potentially reduce the cost of calling this function when cold from 2 to 1 cache misses, by putting the code and data side-by-side:

GLOBAL parity

SECTION .text
parity:
  mov   eax, edi
  shr   edi, 4
  xor   eax, edi
  and   eax, 15
  movzx eax, byte [lut + eax]
  ret

lut:
db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1

It's the same as before, just with the table in the .text section immediately following the function¹.

As far as I know it is generally allowed in most architectures/executable formats, so what are the problems with it?

For example, can the instruction fetch mechanism of the CPU get confused by fetching beyond the ret in my example and trying to interpret the lookup table as (nonsense) instructions?

¹ Note that I deliberately put the table after the code since the code is needed first, but perhaps it doesn't matter in light of critical word first, and anyway the interaction is unclear.

404

asked Jan 17 '17 23:01

BeeOnRope

2 Answers

Summary: I don't think this will be any worse than you'd predict (no special penalties due to this corner case), but it's still usually not worth doing because of split caches / TLBs and having few other benefits. To optimize for the cold-cache case, consider using immediate data (e.g. store the LUT to the stack before use, or there's a lot you can with an immediate bitmap and a bt instruction).

Ideally, you can place your constants with others that are used by code that often runs before or after this code. Compilers can use profile-guided optimization to find "hot" data and cluster it together (at least Intel's VTune help suggests that this can help reduce overall dTLB miss rate).

Possible benefits: L2 cache hit for data loads, or at least DRAM-page locality if the function isn't tiny and the caches were cold to start with.

The main downside is cache/TLB efficiency. Data in code lines / pages is pollution of the L1I cache and iTLB, and code in data lines / pages is pollution of the L1D cache and dTLB.

The first rule of caching is that caches work. Code that runs frequently (and its data) will often be hot in cache. Code that doesn't run frequently is often not important for performance. Trying to optimize for the worst case this way could just end up making the best case less likely (more L1I / L1D misses and/or more TLB misses from including more lines/pages in both code and data footprints).

L2 and outer caches are unified, but L1 is split, and so are the L1 TLBs on any sane microarchitecture, for multiple reasons (physical proximity on chip to front-end or execution units, total number of read/write ports, etc.), but especially on x86 where there is near-zero overlap between code and data in compiler-generated code. All modern x86 designs also use an L2TLB that handles misses from either L1iTLB or L1dTLB. (Intel's optimization manual calls it the STLB, for Second level).

But unlike the L2 cache, I think Intel's STLB is a victim cache for both the iTLB and dTLB. (I don't remember where I read this, and can't find a source for it, though.) If my memory is correct, an L1TLB miss that hits in the STLB exchanges entries, so nothing is evicted or duplicated. On a miss in both levels, a page-walk to only loads the L1iTLB with the new entry. (I think the evicted entry goes into the STLB, and the LRU entry from that set in the STLB is evicted).

Thus, if I'm right about TLB behaviour on Intel CPUs, the dTLB miss from movzx eax, byte [lut + eax] will miss in the STLB (if caches were cold to start with), triggering another page walk even though the same page must already be hot in the iTLB for the data load to be executed. At least the page-table entries will be hot in L1D cache, and any internal page-walker caches.

It would be possible to test this behaviour with code that jumped from page to page, loading itself as data. e.g. repeat this block: here: mov eax, [rip+0] / jmp here+4096 / align 4096. Then look at perf counters for stlb misses from data loads (not code-fetch). This would make code/data locality in terms of 4k or 2M pages much less valuable than it otherwise would be, but still no worse than totally separate (except for the pollution issue that there could have been useful code there reducing the total number of code pages touched).

If your function+data isn't all contained in the same cache line, a load early in the function could result in outstanding misses (to L2) for the same line from both L1D(demand load) and L1I(speculative code fetch). I don't know if there's any problem with that on any x86 uarches. I'd guess probably no worse than usual, and hopefully better than outstanding missed for two different lines. I'd guess that hardware doesn't trigger any kind of slow corner-case handling for this, but I haven't tested.

If the end of the function+data is on the next page, you could even get and iTLB and dTLB miss in parallel for the same page, from a demand load + code-fetch.

However, I think data loaded from the same cache line as the current code will typically hit in L2 (although it's possible that it's still hot in L1I but evicted from L2, and possibly even from L3 on CPUs like Skylake-AVX512 where L3 isn't inclusive). This may sometimes be worth the inflation of data working set and cache working set that comes from mixing both in the same line.

Non-x86:

ARM compilers (and assemblers for ldr r0, =constant pseudo-instructions) use literal-pools to load constants larger than 16 bits with small PC-relative displacements. I think these often end up on the same page, and sometimes the same cache-line, as code. Obviously ARM microarchitectures are designed to run such code efficiently (except for the wasted I-cache / D-cache space which is unavoidable). But presumably the code-size / instruction-count benefit is usually worth it. I'm not sure why this is common on ARM but not other RISC ISAs (at least I think it's not common on MIPS / PowerPC). Modern ARM has good support for creating arbitrary 32-bit constants with 2 instructions, and many bit-patterns can be created with a single instruction (using the barrel shifter with an immediate mov or mvn).

But there's no reason to expect that x86 microarchitectures take any special care to handle such cases any more efficiently than they would by default, because this pattern is not common in x86. It's not used by compilers, and the only RIP-relative addressing mode uses a rel32 displacement so there isn't even a code-size advantage from placing data very near code. Only the locality benefit for L3/L2 (and DRAM pages).

That doesn't mean we should expect it to be slow, only that we can't infer x86 behaviour from the fact that ARM CPUs need to support it efficiently. ARM CPUs that do use / support paging may have a TLB allocation policy that favours this pattern, like allocating an L2TLB entry on an iTLB miss. (If they use a multi-level TLB at all).

For example, can the instruction fetch mechanism of the CPU get confused by fetching beyond the ret in my example and trying to interpret the lookup table as (nonsense) instructions?

Speculative execution beyond a ret is usually not a problem, it seems.

I tried to test this once (with a fairly poor test that I didn't put much effort into), but I couldn't find any effect. Perhaps I didn't have a big enough test to defeat branch-prediction for the ret, or speculative execution doesn't continue beyond ret the way it does for other indirect jumps. (Even if a call instruction is the first instruction of a function, and the callee is contiguous with that function, the correct return address is after the call, not to run the call again. So speculative execution of instructions following ret can only be useful in cases where something put a fake return address on the stack.)

If the last instruction before your data was an indirect jmp, then it would make sense to worry about how the data decoded as instructions. You could block speculative execution by placing an int3 or ud2 after it, before your data, as recommended by Intel's optimization manual:

3.4.1.6 Branch Type Selection

blah blah fall-through is the fall-back default prediction for indirect jumps (but they don't mention ret). Bogus instructions can slow down branch recovery.

Also, data immediately following indirect branches may appear as branches to the branch predication hardware, which can branch off to execute other data pages. This can lead to subsequent self-modifying code problems.

Assembly/Compiler Coding Rule 14. (M impact, L generality) When indirect branches are present, try to put the most likely target of an indirect branch immediately following the indirect branch. Alternatively, if indirect branches are common but they cannot be predicted by branch prediction hardware, then follow the indirect branch with a UD2 instruction, which will stop the processor from decoding down the fall-through path.

Read-only access should be fine for the out-of-order pipeline itself. Only write accesses near (maybe within 2k or 4k of) EIP/RIP cause self-modifying-code machine-nukes / pipeline clears. (So obviously do not use this for non-const static data, which you normally can't anyway because code pages are normally mapped read/exec but not write.)

If your LUT is small enough, use it as immediate data instead of a load

If cold-cache performance is important, you can store your LUT to the stack with a couple mov r64, imm64 / mov [m64], r64 instructions. (Or maybe mov r/m32, imm32).

Immediate bitmaps are great to set up for a bt instruction. As @Ross points out, you could do

mov   eax, 0x5555
bt    eax, edi
setc  al

Or with the bitmap as an immediate operand to a test instruction:

xor   eax, eax
bts   eax, edi           ; 1U << arg

test  eax, 0x5555
setc  al

Compilers will use this trick for switch when a lot of case-labels all run the same code, like in this case. On Godbolt with gcc and clang.

Another example: a vowel/consonant bitmap in a code-golf answer for classifying strings according to whether their vowel/consonant pattern is palindromic.

In truly hot functions, it is often better to load instead of mov-immediate, especially if it saves multiple mov instructions. But even saving one fused-domain uop by using a memory operand for an ALU instruction can be worth it, so there is a tradeoff between cold vs. hot cache performance. (But never use bt with a memory operand; its performance is garbage because of the crazy-CISC semantics for indexing a bit-string instead of wrapping to the dword or qword selected by the addressing mode the way it wraps with a register destination.)

Or simply compute instead of using a LUT at all. test eax,eax / setp al because parity (of the low byte only) is supported in hardware on x86. Other architectures with hardware popcnt could use that and take the low bit for even / odd parity.

But other problems can save a lot of work with a LUT or a small vector constant. (Maybe compressed for loading with a broadcast-load like movddup or vpbroadcastd, or a per-element expansion like pmovsx/pmovzx)

answered Oct 14 '22 14:10

Peter Cordes

One thing that I can think of is that while it makes good use of the L2 and higher level caches, it isn't as clear a win in the L1 data and instruction caches on architectures where they are split¹, since the line will appear in both the L1D and L1I caches, wasting some space in each (e.g., the space used to cache the lookup table in the L1I is wasted).

¹ I.e., pretty much all the main architectures today - almost everyone is modified harvard...

answered Oct 14 '22 14:10

BeeOnRope

Related questions
                            
                                Performance issues with saving object reference to array
                            
                                Why are math libraries often compared by FLOPS?
                            
                                What is the meaning of "% Time in GC" performance counter
                            
                                Haskell / GHC: Unsafely deconstruct a constructor?
                            
                                How does Scala compiler handle unused variable values?
                            
                                Android Application onLowMemory() on API >= 14
                            
                                mix-blend-mode scroll lag issues
                            
                                MATLAB performance benefit from clearing temporary variables?
                            
                                How to process two long lines with constant memory?
                            
                                How to find the server processing time by IIS logs?
                            
                                Why will for-loop with multithreading not have as great performance as with single-thread?
                            
                                WPF 2D in 3D view Animations : Performance issue
                            
                                pyopencl - pycuda performance difference
                            
                                Best practices for logging real-time data into a NoSQL DB
                            
                                How to find the order of discrete point-set efficiently?
                            
                                Unable to reduce the number of watchers in ng-repeat
                            
                                The best way to convert time zone efficiently in MYSQL query
                            
                                How to get read/write disk speed in Python?
                            
                                efficient way to calculate distance between combinations of pandas frame columns
                            
                                Rails with angular vs Rails pure (views performance)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is placing code and read-only data it uses right next to each other a good idea?

Tags:

performance

optimization

x86

assembly

lookup-tables

BeeOnRope

People also ask

2 Answers

Non-x86:

If your LUT is small enough, use it as immediate data instead of a load

Peter Cordes

BeeOnRope

Recent Activity

Donate For Us