<code>SYSCALL</code> and <code>SYSRET</code> (and their 32-bit-only Intel counterparts <code>SYSENTER</code> and <code>SYSEXIT</code>) are usually described as a “generally faster” way to enter and exit supervisor mode in x86 processors than call gates or software interrupts, but the exact figures underlying this claim remain largely undocumented. In particular, all of the Intel or AMD optimization guides I was able to find contain no mention of these instructions at all. So: <ul> <li>How many cycles (estimated) do <code>SYSCALL</code> and <code>SYSRET</code> take across recent Intel 64 microarchitectures? This is probably measurable by direct experimentation, but there are quite a few of different CPUs to test.</li> </ul> Depending on the order of magnitude of this number, more detailed questions may be relevant: <ul> <li>Do they incur a complete pipeline stall, or any other kind of stall?</li> <li>How, if at all, do they interact with branch prediction (e.g. the return stack buffer) and fetch logic?</li> <li>What about latencies, data dependencies, serialization?</li> <li>&tc.</li> </ul> Assume 64-bit code on the userspace side, no additional address-space switches (writes to CR3) and even matching <code>SYSCALL</code> and <code>SYSRET</code> pairs if it matters.

I was curious too so I've written some basic bare-metal code to benchmark it: just a loop that calls syscall 1000000 times in a loop, with the syscall handler just running sysret and nothing else. On my Ryzen 7 3700X it averages 78 cycles for the call+return. Obviously that's an artificial benchmark, because a real system call handler will likely need to do some things like switch stacks and perform Spectre mitigations. But it gives an idea of the order-of-magnitude, which is less than a cache miss.

How do SYSCALL/SYSRET instructions perform across x86 CPUs?

Tags:

performance

x86

kernel

system-calls

SYSCALL and SYSRET (and their 32-bit-only Intel counterparts SYSENTER and SYSEXIT) are usually described as a “generally faster” way to enter and exit supervisor mode in x86 processors than call gates or software interrupts, but the exact figures underlying this claim remain largely undocumented. In particular, all of the Intel or AMD optimization guides I was able to find contain no mention of these instructions at all. So:

How many cycles (estimated) do SYSCALL and SYSRET take across recent Intel 64 microarchitectures? This is probably measurable by direct experimentation, but there are quite a few of different CPUs to test.

Depending on the order of magnitude of this number, more detailed questions may be relevant:

Do they incur a complete pipeline stall, or any other kind of stall?
How, if at all, do they interact with branch prediction (e.g. the return stack buffer) and fetch logic?
What about latencies, data dependencies, serialization?
&tc.

Assume 64-bit code on the userspace side, no additional address-space switches (writes to CR3) and even matching SYSCALL and SYSRET pairs if it matters.

374

asked Aug 07 '13 23:08

Alex Shpilkin

1 Answers

I was curious too so I've written some basic bare-metal code to benchmark it: just a loop that calls syscall 1000000 times in a loop, with the syscall handler just running sysret and nothing else. On my Ryzen 7 3700X it averages 78 cycles for the call+return.

Obviously that's an artificial benchmark, because a real system call handler will likely need to do some things like switch stacks and perform Spectre mitigations. But it gives an idea of the order-of-magnitude, which is less than a cache miss.

184

answered Sep 28 '22 13:09

Bruce Merry

Related questions
                            
                                draw 10,000 objects on canvas javascript
                            
                                How to use Rcpp to speed up a for loop?
                            
                                Fastest way to separate the digits of an int into an array in .NET?
                            
                                When and why should you use NSUserDefaults's synchronize() method?
                            
                                How to compute volatility (standard deviation) in rolling window in Pandas
                            
                                C++ function parameters: use a reference or a pointer (and then dereference)?
                            
                                How to write code in Visual Studio faster? [closed]
                            
                                OpenCV: C++ and C performance comparison
                            
                                Unable to start uiautomatorviewer
                            
                                int, short, byte performance in back-to-back for-loops
                            
                                Most Efficient way to test large number of strings against a large List<string>
                            
                                fastest way to replace string in a template
                            
                                JavaScript, sorting with second parameter is faster
                            
                                Reasons for high CPU usage in SocketInputStream.socketRead0()
                            
                                Fastest HOG Feature Extraction implementation?
                            
                                Cassandra Wide Vs Skinny Rows for large columns
                            
                                Performance Problems using System.js
                            
                                Swift Protocol Performance
                            
                                WPF C# Application Performance
                            
                                Raphael SVG animations choppy in some browsers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With