Inlining and Instruction Cache Hit Rates and Thrashing

Question

In this article, https://www.geeksforgeeks.org/inline-functions-cpp/, it states that the disadvantages of inlining are:

3) Too much inlining can also reduce your instruction cache hit rate, thus reducing the speed of instruction fetch from that of cache memory to that of primary memory.

How does inlining affect the instruction cache hit rate?

6) Inline functions might cause thrashing because inlining might increase size of the binary executable file. Thrashing in memory causes performance of computer to degrade.

How does inlining increase the size of the binary executable file? Is it just that it increases the code base length? Moreover, it is not clear to me why having a larger binary executable file would cause thrashing as the two dont seem linked.

BeeOnRope · Accepted Answer

It is possible that the confusion about why inlining can hurt i-cache hit rate or cause thrashing lies in the difference between static instruction count and dynamic instruction count. Inlining (almost always) reduces the latter but often increases the former.

Let us briefly examine those concepts.

Static Instruction Count

Static instruction count for some execution trace is the number of unique instructions⁰ that appear in the binary image. Basically, you just count the instruction lines in an assembly dump. The following snippet of x86 code has a static instruction count of 5 (the .top: line is a label which doesn't translate to anything in the binary):

  mov eci, 10
  mov eax, 0
.top:
  add eax, eci
  dec eci
  jnz .top

The static instruction count is mostly important for binary size, and caching considerations.

Static instruction count may also be referred to simply as "code size" and I'll sometimes use that term below.

Dynamic Instruction Count

The dynamic instruction count, on the other hand, depends on the actual runtime behavior and is the number of instructions executed. The same static instruction can be counted multiple times due to loops and other branches, and some instructions included in the static count may never execute at all and so no count in the dynamic case. The snippet as above, has a dynamic instruction count of 2 + 30 = 32: the first two instructions are executed once, and then the loop executes 10 times with 3 instructions each iteration.

As a very rough approximation, dynamic instruction count is primarily important for runtime performance.

The Tradeoff

Many optimizations such as loop unrolling, function cloning, vectorization and so on increase code size (static instruction count) in order to improve runtime performance (often strongly correlated with dynamic instruction count).

Inlining is also such an optimization, although with the twist that for some call sites inlining reduces both dynamic and static instruction count.

How does inlining affect the instruction cache hit rate?

The article mentioned too much inlining, and the basic idea here is that a lot of inlining increases the code footprint by increasing the working set's static instruction count while usually reducing its dynamic instruction count. Since a typical instruction cache¹ caches static instructions, a larger static footprint means increased cache pressure and often results in a worse cache hit rate.

The increased static instruction count occurs because inlining essentially duplicates the function body at each the call site. So rather than one copy of the function body an a few instructions to call the function N times, you end up with N copies of the function body.

Now this is a rather naive model of how inlining works since after inlining, it may be the case that further optimizations can be done in the context of a particular call-site, which may dramatically reduce the size of the inlined code. In the case of very small inlined functions or a large amount of subsequent optimization, the resulting code may be even smaller after inlining, since the remaining code (if any) may be smaller than the overhead involved in the calling the function².

Still, the basic idea remains: too much inlining can bloat the code in the binary image.

The way the i-cache works depends on the static instruction count for some execution, or more specifically the number of instruction cache lines touched in the binary image, which is largely a fairly direct function of the static instruction count. That is, the i-cache caches regions of the binary image so the more regions and the larger they are, the larger the cache footprint, even if the dynamic instruction count happens to be lower.

How does inlining increase the size of the binary executable file?

It's exactly the same principle as the i-cache case above: larger static footprint means that more distinct pages need to paged in, potentially leading to more pressure on the VM system. Now we usually measure code sizes in megabytes, while memory on servers, desktops, etc are usually measured in gigabytes, so it's highly unlikely that excessive inlining is going to meaningfully contribute to the thrashing on such systems. It could perhaps be a concern on much smaller or embedded systems (although the latter often don't have a MMU at at all).

⁰ Here unique refers, for example, to the IP of the instruction, not to the actual value of the encoded instruction. You might find inc eax in multiple places in the binary, but each are unique in this sense since they occur at a different location.

¹ There are exceptions, such as some types of trace caches.

² On x86, the necessary overhead is pretty much just the call instruction. Depending on the call site, there may also be other overhead, such as shuffling values into the correct registers to adhere to the ABI, and spilling caller-saved registers. More generally, there may be a large cost to a function call simply because the compiler has to reset many of its assumptions across a function call, such as the state of memory.

Mike Vine · Answer

Lets say you have a function thats 100 instructions long and it takes 10 instructions to call it whenever its called.

That means for 10 calls it uses up 100 + 10 * 10 = 200 instructions in the binary.

Now lets say its inlined everywhere its used. That uses up 100*10 = 1000 instructions in your binary.

So for point 3 this means that it will take significantly more space in the instructions cache (different invokations of an inline function are not 'shared' in the i-cache)

And for point 6 your total binary size is now bigger, and a bigger binary size can lead to thrashing

Inlining and Instruction Cache Hit Rates and Thrashing

Tags:

cpu-architecture

memory

caching

executable

inline

Trajan

2 Answers

Static Instruction Count

Dynamic Instruction Count

The Tradeoff

BeeOnRope

Mike Vine

Recent Activity

Donate For Us

Inlining and Instruction Cache Hit Rates and Thrashing

Tags:

cpu-architecture

memory

caching

executable

inline

Trajan

2 Answers

Static Instruction Count

Dynamic Instruction Count

The Tradeoff

BeeOnRope

Mike Vine

Related questions

Recent Activity

Donate For Us