When profiling code at the the assembly instruction level, what does the position of the instruction pointer really mean given that modern CPUs don't execute instructions serially or in-order? For example, assume the following x64 assembly code: <pre class="prettyprint"><code>mov RAX, [RBX]; // Assume a cache miss here. mov RSI, [RBX + RCX]; // Another cache miss. xor R8, R8; add RDX, RAX; // Dependent on the load into RAX. add RDI, RSI; // Dependent on the load into RSI. </code></pre> Which instruction will the instruction pointer spend most of its time on? I can think of good arguments for all of them: <ul> <li> <code>mov RAX, [RBX]</code> is taking probably 100s of cycles because it's a cache miss.</li> <li> <code>mov RSI, [RBX + RCX]</code> also takes 100s of cycles, but probably executes in parallel with the previous instruction. What does it even mean for the instruction pointer to be on one or the other of these?</li> <li> <code>xor R8, R8</code> probably executes out-of-order and finishes before the memory loads finish, but the instruction pointer might stay here until all previous instructions are also finished.</li> <li> <code>add RDX, RAX</code> generates a pipeline stall because it's the instruction where the value of <code>RAX</code> is actually used after a slow cache-miss load into it.</li> <li> <code>add RDI, RSI</code> also stalls because it's dependent on the load into <code>RSI</code>.</li> </ul>

It's a good question, but in the kind of performance tuning I do, it doesn't matter. It doesn't really matter because what you're looking for is speed-bugs. These are things that the code is doing that take clock time and that could be done better or not at all. Examples: - Spending I/O time looking in DLLs for resources that don't, actually, need to be looked for. - Spending time in memory-allocation routines making and freeing objects that could simply be re-used. - Re-calculating things in functions that could be memo-ized. ... this is just a few off the top of my head Your biggest enemy is a self-congratulatory tendency to say "I wouldn't consciously write any bugs. Why would I?" Of course, you know that's why you test software. But the same goes for speed-bugs, and if you don't know how to find those you assume there are none, which is a way of saying "My code has no possible speedups, except maybe a profiler can show me how to shave a few cycles." In my half-century experience, there is no code that, as first written, contains no speed-bugs. What's more, there's an enormous multiplier effect, where every speed-bug you remove makes the remaining ones more obvious. As a contrived example, suppose bug A accounts for 90% of clock time, and bug B accounts for 9%. If you only fix B, big deal - the code is 11% faster. If you only fix A, that's good - it's 10x faster. But if you fix both, that's really good - it's 100x faster. Fixing A made B big. So the thing you need most in performance tuning is to find the speed-bugs, and not miss any. When you've done all that, then you can get down to cycle-shaving.

Instruction Level Profiling: The Meaning of the Instruction Pointer?

Tags:

performance

assembly

profiling

x86-64

low-level

When profiling code at the the assembly instruction level, what does the position of the instruction pointer really mean given that modern CPUs don't execute instructions serially or in-order? For example, assume the following x64 assembly code:

mov RAX, [RBX];         // Assume a cache miss here.
mov RSI, [RBX + RCX];   // Another cache miss.             
xor R8, R8;        
add RDX, RAX;           // Dependent on the load into RAX.
add RDI, RSI;           // Dependent on the load into RSI.

Which instruction will the instruction pointer spend most of its time on? I can think of good arguments for all of them:

mov RAX, [RBX] is taking probably 100s of cycles because it's a cache miss.
mov RSI, [RBX + RCX] also takes 100s of cycles, but probably executes in parallel with the previous instruction. What does it even mean for the instruction pointer to be on one or the other of these?
xor R8, R8 probably executes out-of-order and finishes before the memory loads finish, but the instruction pointer might stay here until all previous instructions are also finished.
add RDX, RAX generates a pipeline stall because it's the instruction where the value of RAX is actually used after a slow cache-miss load into it.
add RDI, RSI also stalls because it's dependent on the load into RSI.

247

asked Jun 09 '13 13:06

dsimcha

2 Answers

CPUs maintains a fiction that there are only the architectural registers (RAX, RBX, etc) and there is a specific instruction pointer (IP). Programmers and compilers target this fiction.

Yet as you noted, modern CPUs don't execute serially or in-order. Until you the programmer / user request the IP, it is like Quantum Physics, the IP is a wave of instructions being executed; all so that the processor can run the program as fast as possible. When you request the current IP (for example, via a debugger breakpoint or profiler interrupt), then the processor must recreate the fiction that you expect so it collapses this wave form (all "in flight" instructions), gathers the register values back into architectural names, and builds a context for executing the debugger routine, etc.

In this context, there is an IP that indicates the instruction where the processor should resume execution. During the out-of-order execution, this instruction was the oldest instruction yet to complete, even though at the time of the interrupt the processor was perhaps fetching instructions well past that point.

For example, perhaps the interrupt indicates mov RSI, [RBX + RCX]; as the IP, but the xor had already executed and completed; however, when the processor would resume execution after the interrupt, it will re-execute the xor.

answered Oct 18 '22 05:10

Brian

It's a good question, but in the kind of performance tuning I do, it doesn't matter. It doesn't really matter because what you're looking for is speed-bugs. These are things that the code is doing that take clock time and that could be done better or not at all. Examples:
- Spending I/O time looking in DLLs for resources that don't, actually, need to be looked for.
- Spending time in memory-allocation routines making and freeing objects that could simply be re-used.
- Re-calculating things in functions that could be memo-ized.
... this is just a few off the top of my head

Your biggest enemy is a self-congratulatory tendency to say "I wouldn't consciously write any bugs. Why would I?" Of course, you know that's why you test software. But the same goes for speed-bugs, and if you don't know how to find those you assume there are none, which is a way of saying "My code has no possible speedups, except maybe a profiler can show me how to shave a few cycles."

In my half-century experience, there is no code that, as first written, contains no speed-bugs. What's more, there's an enormous multiplier effect, where every speed-bug you remove makes the remaining ones more obvious. As a contrived example, suppose bug A accounts for 90% of clock time, and bug B accounts for 9%. If you only fix B, big deal - the code is 11% faster. If you only fix A, that's good - it's 10x faster. But if you fix both, that's really good - it's 100x faster. Fixing A made B big.

So the thing you need most in performance tuning is to find the speed-bugs, and not miss any. When you've done all that, then you can get down to cycle-shaving.

answered Oct 18 '22 06:10

Mike Dunlavey

Related questions
                            
                                Performance with time related algorithm
                            
                                Performance of Func<T> and inheritance
                            
                                Android: why is native code so much faster than Java code
                            
                                How to optimize performance for a docker container?
                            
                                Node.JS performance vs native C++ addon when populating an Int32Array
                            
                                Why does the Linux Kernel use the data structures that it does?
                            
                                Improve Large ListView Adapter smooth scroll, sometimes jerky
                            
                                How faster is tensorflow-gpu with AVX and AVX2 compared with it without AVX and AVX2?
                            
                                Slow javascript execution in IE11 until developer tools are enabled
                            
                                Why is this System.IO.Pipelines code much slower than Stream-based code?
                            
                                HashSet performance Add vs Contains for existing elements
                            
                                jQuery html() acting really slow
                            
                                Do we need to use MappedByteBuffer.force() to flush data to disk?
                            
                                What is the performance of STL bitset::count() method?
                            
                                JVM and GC tuning - theory for no Full GC
                            
                                "is not null" vs boolean MySQL - Performance
                            
                                Structure of arrays and array of structures - performance difference
                            
                                Operator overload or comparison function in C++ priority queue
                            
                                Numpy running at half the speed of MATLAB
                            
                                Why is an extra FrameLayout created for fragments?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With