What does "rep; nop;" mean in x86 assembly? Is it the same as the "pause" instruction?

Motivation for this question

After some discussion in the comments of another question, I realized that I don't know what rep; nop; means in x86 (or x86-64) assembly. And also I couldn't find a good explanation on the web.

I know that rep is a prefix that means "repeat the next instruction cx times" (or at least it was, in old 16-bit x86 assembly). According to this summary table at Wikipedia, it seems rep can only be used with movs, stos, cmps, lods, scas (but maybe this limitation was removed on newer processors). Thus, I would think rep nop (without semi-colon) would repeat a nop operation cx times.

However, after further searching, I got even more confused. It seems that rep; nop and pause map to the exactly same opcode, and pause has a bit different behavior than just nop. Some old mail from 2005 said different things:

"try not to burn too much power"
"it is equivalent to 'nop' just with 2 byte encoding."
"it is magic on intel. Its like 'nop but let the other HT sibling run'"
"it is pause on intel and fast padding on Athlon"

With these different opinions, I couldn't understand the correct meaning.

It's being used in Linux kernel (on both i386 and x86_64), together with this comment: /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */ It is also being used in BeRTOS, with the same comment.

544

asked Aug 16 '11 23:08

Denilson Sá Maia

2 Answers

rep; nop is indeed the same as the pause instruction (opcode F390). It might be used for assemblers which don't support the pause instruction yet. On previous processors, this simply did nothing, just like nop but in two bytes. On new processors which support hyperthreading, it is used as a hint to the processor that you are executing a spinloop to increase performance. From Intel's instruction reference:

Improves the performance of spin-wait loops. When executing a “spin-wait loop,” a Pentium 4 or Intel Xeon processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops.

135

answered Sep 23 '22 17:09

ughoavgfhw

rep nop = F3 90 = the encoding for pause, as well as how it decodes on older CPUs that don't support pause.

Prefixes (other than lock) that don't apply to an instruction are ignored in practice by existing CPUs.

The documentation says using rep with instructions it doesn't apply to is "reserved and can cause unpredictable behaviour" because future CPUs might recognize it as part of some new instruction. Once they establish any specific new instruction encoding using f3 xx, they document how it runs on older CPUs. (Yes, the x86 opcode space is so limited that they do crazy stuff like this, and yes it makes the decoders complicated.)

In this case, it means you can use pause in spinloops without breaking backwards compat. Old CPUs that don't know about pause will decode it as a NOP with no harm done, as guaranteed by Intel's ISA ref manual entry for pause. On new CPUs, you get the benefit of power-saving / HT friendliness, and avoiding memory-ordering mis-speculation when the memory you're spinning on does change and you leave the spin loop.

Links to Intel's manuals and tons of other good stuff on the x86 tag wiki info page

Another case of a meaningless rep prefix becoming a new instruction on new CPUs: lzcnt is F3 0F BD /r. On CPUs that don't support that instruction (missing the LZCNT feature flag in their CPUID), it decodes as rep bsr, which runs the same as bsr. So on old CPUs, it produces 32 - expected_result, and is undefined when the input was zero.

But tzcnt and bsf do the same thing with non-zero inputs, so compilers can and do use tzcnt even when it's not guaranteed that the target CPU will run it as tzcnt. AMD CPUs have fast tzcnt, slow bsf, and on Intel they're both fast. As long as it doesn't matter for correctness (you're not relying on flag-setting, or on leaving the destination unmodified behaviour in the input=0 case), having it decode as tzcnt on CPUs that support it is helpful.

One case of a meaningless rep prefix that will probably never decode differently: rep ret is used by default by gcc when targeting "generic" CPUs (i.e. not targetting a specific CPU with -march or -mtune, and not targetting AMD K8 or K10.) It will be decades before anyone could make a CPU that decodes rep ret as anything other than ret, because it's present in most binaries in most Linux distros. See What does `rep ret` mean?

answered Sep 24 '22 17:09

Peter Cordes

Related questions
                            
                                How should I get started on writing device drivers? [closed]
                            
                                What is Intel microcode?
                            
                                "Assembly" vs. "Assembler"
                            
                                How do I compile the asm generated by GCC?
                            
                                Enhanced REP MOVSB for memcpy
                            
                                Why does adding inline assembly comments cause such radical change in GCC's generated code?
                            
                                Go isn't linking my assembly: undefined external function
                            
                                What's the purpose of the CIL nop opcode?
                            
                                C code loop performance [continued]
                            
                                What are SP (stack) and LR in ARM?
                            
                                What are IN & OUT instructions in x86 used for?
                            
                                What does @plt mean here?
                            
                                Why does GCC pad functions with NOPs?
                            
                                x86_64 registers rax/eax/ax/al overwriting full register contents [duplicate]
                            
                                How to remove "noise" from GCC/clang assembly output?
                            
                                How does the stack work in assembly language?
                            
                                Why do you program in assembly? [closed]
                            
                                If statement vs if-else statement, which is faster?
                            
                                If registers are so blazingly fast, why don't we have more of them?
                            
                                How to see JIT-compiled code in JVM?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What does "rep; nop;" mean in x86 assembly? Is it the same as the "pause" instruction?

Tags:

x86

assembly

x86-64

cpu

machine-code