Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How many 1-byte NOPs can Skylake execute at one cycle

I'm aligning branch targets with NOPs, and sometimes the CPU executes these NOPs, up to 15 NOPs. How many 1-byte NOPs can Skylake execute in one cycle? What about other Intel-compatible processors, like AMD? I'm interested not only in Skylake but in other microarchitectures as well. How many cycles may it take to execute a sequence of 15 NOPs? I want to know whether the extra code size and extra execution time of adding these NOPs worth its price. This is not me who adding these NOPs but an assembler automatically whenever I write an align directive.

Update: I have managed the assembler to insert multibyte NOPs automatically.

like image 859
Maxim Masiutin Avatar asked Jul 11 '17 17:07

Maxim Masiutin


People also ask

How many bytes is a NOP?

NOP is a one-byte instruction that takes up space but affects none of the machine context except (E)IP. NOP is an alias mnemonic for the XCHG (E)AX, (E)AX instruction.

How many cycles is Noop?

NOP consumes two clock cycles.


2 Answers

This is not me who adding these NOPs but an assembler. It is pretty dumb and do not support options (BASM) for alignment - there is just one option - boundary size.

I don't know what "BASM" is, and I can't find any reference to it online (except this, which obviously isn't x86), but if it doesn't support multi-byte NOPs, you really need a different assembler. This is just really basic stuff that's been in the Intel and AMD architecture manuals for years. The Gnu assembler can do this for ALIGN directives, and so can Microsoft's MASM. The open-source NASM and YASM assemblers support this as well, and either of these can be integrated into any existing build system easily.

By multi-byte NOPs, I mean the following, which you can find in AMD and Intel processor manuals:

Length   |  Mnemonic                                 |  Opcode Bytes
---------|-------------------------------------------|-------------------------------------
1 byte   |  NOP                                      |  90
2 bytes  |  66 NOP                                   |  66 90
3 bytes  |  NOP DWORD [EAX]                          |  0F 1F 00
4 bytes  |  NOP DWORD [EAX + 00H]                    |  0F 1F 40 00
5 bytes  |  NOP DWORD [EAX + EAX*1 + 00H]            |  0F 1F 44 00 00
6 bytes  |  66 NOP DWORD [EAX + EAX*1 + 00H]         |  66 0F 1F 44 00 00
7 bytes  |  NOP DWORD [EAX + 00000000H]              |  0F 1F 80 00 00 00 00
8 bytes  |  NOP DWORD [EAX + EAX*1 + 00000000H]      |  0F 1F 84 00 00 00 00 00
9 bytes  |  66 NOP DWORD [EAX + EAX*1 + 00000000H]   |  66 0F 1F 84 00 00 00 00 00

The sequence recommendations offered by the two manufacturers diverge slightly after 9 bytes, but NOPs that long are…not terribly common. And probably don't matter very much, since the extremely long NOP instructions with the excessive number of prefixes are going to degrade performance anyway. These work all the way back to the Pentium Pro, so they are universally supported today.

Agner Fog has this to say about multi-byte NOPs:

The multi-byte NOP instruction has the opcode 0F 1F + a dummy memory operand. The length of the multi-byte NOP instruction can be adjusted by optionally adding 1 or 4 bytes of displacement and a SIB byte to the dummy memory operand and by adding one or more 66H prefixes. An excessive number of prefixes can cause delay on older microprocessors, but at least two prefixes is acceptable on most processors. NOPs of any length up to 10 bytes can be constructed in this way with no more than two prefixes. If the processor can handle multiple prefixes without penalty then the length can be up to 15 bytes.

All of the redundant/superfluous prefixes are simply ignored. The advantage, of course, is that many newer processors have lower decode rates for multi-byte NOPs, making them more efficient. They will be faster than a series of 1-byte NOP (0x90) instructions.

Perhaps even better than multi-byte NOPs for alignment is using longer forms of the instructions that you're already using in your code. These lengthier encodings don't take any longer to execute (they affect only decode bandwidth), so they're faster/cheaper than NOPs. Examples of this are:

  • Using the mod-reg-r/m byte forms of instructions like INC, DEC, PUSH, POP, etc., instead of the short versions
  • Using an equivalent instruction that is longer, like ADD instead of INC or LEA instead of MOV.
  • Encoding longer forms of immediate operands (e.g., 32-bit immediates instead of sign-extended 8-bit immediates)
  • Adding SIB bytes and/or unnecessary prefixes (e.g., operand-size, segment, and REX in long mode)

Agner Fog's manuals speak at length about and give examples of these techniques, as well.

I don't know of any assemblers that will do these conversions/optimizations for you automatically (assemblers pick the shortest version, for obvious reasons), but they usually have a strict mode where you can force a particular encoding to be used, or you can just manually emit the instruction bytes. You only do this in highly performance-sensitive code anyway, where the work will actually pay off, so that limits the scope of the effort required substantially.

I want to know whether extra code size and extra execution time of adding these NOPs worth its price.

In general, no. While data alignment is extremely important and essentially free (size of the binary notwithstanding), code alignment is a lot less important. There are cases in tight loops where it can make a significant difference, but this only matters in hot spots in your code, which your profiler will already be identifying, and then you can perform the manipulations to manually align the code if necessary. Otherwise, I wouldn't worry about it.

It makes sense to align functions, as the padding bytes between them are never executed (rather than using NOPs here, you'll often see INT 3 or an invalid instruction, like UD2), but I wouldn't go around aligning all of your branch targets within functions simply as a matter of course. Only do it in known critical inner loops.

As ever, Agner Fog talks about this, and says it better than I could:

Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. If an important subroutine entry or jump label happens to be near the end of a 16-byte block then the microprocessor will only get a few useful bytes of code when fetching that block of code. It may have to fetch the next 16 bytes too before it can decode the first instructions after the label. This can be avoided by aligning important subroutine entries and loop entries by 16. Aligning by 8 will assure that at least 8 bytes of code can be loaded with the first instruction fetch, which may be sufficient if the instructions are small. We may align subroutine entries by the cache line size (typically 64 bytes) if the subroutine is part of a critical hot spot and the preceding code is unlikely to be executed in the same context.

A disadvantage of code alignment is that some cache space is lost to empty spaces before the aligned code entries.

In most cases, the effect of code alignment is minimal. So my recommendation is to align code only in the most critical cases like critical subroutines and critical innermost loops.

Aligning a subroutine entry is as simple as putting as many NOP's as needed before the subroutine entry to make the address divisible by 8, 16, 32 or 64, as desired. The assembler does this with the ALIGN directive. The NOP's that are inserted will not slow down the performance because they are never executed.

It is more problematic to align a loop entry because the preceding code is also executed. It may require up to 15 NOP's to align a loop entry by 16. These NOP's will be executed before the loop is entered and this will cost processor time. It is more efficient to use longer instructions that do nothing than to use a lot of single-byte NOP's. The best modern assemblers will do just that and use instructions like MOV EAX,EAX and LEA EBX,[EBX+00000000H] to fill the space before an ALIGN nn statement. The LEA instruction is particularly flexible. It is possible to give an instruction like LEA EBX,[EBX] any length from 2 to 8 by variously adding a SIB byte, a segment prefix and an offset of one or four bytes of zero. Don't use a two-byte offset in 32-bit mode as this will slow down decoding. And don't use more than one prefix because this will slow down decoding on older Intel processors.

Using pseudo-NOPs such as MOV RAX,RAX and LEA RBX,[RBX+0] as fillers has the disadvantage that it has a false dependence on the register, and it uses execution resources. It is better to use the multi-byte NOP instruction which can be adjusted to the desired length. The multi-byte NOP instruction is available in all processors that support conditional move instructions, i.e. Intel PPro, P2, AMD Athlon, K7 and later.

An alternative way of aligning a loop entry is to code the preceding instructions in ways that are longer than necessary. In most cases, this will not add to the execution time, but possibly to the instruction fetch time.

He also goes on to show an example of another way to align an inner loop by moving the preceding subroutine entry. This is kind of awkward, and requires some manual adjustment even in the best of assemblers, but it may be the most optimal mechanism. Again, this only matters in critical inner loops on the hot path, where you're probably already digging in and micro-optimizing anyway.

Anecdotally, I've benchmarked code that I was in the middle of optimizing several times, and didn't find very much if any benefit to aligning a loop branch target. For example, I was writing an optimized strlen function (Gnu libraries have one, but Microsoft's don't), and tried aligning the target of the main inner loop on 8-byte, 16-byte, and 32-byte boundaries. None of these made much of a difference, especially not when compared to the other drastic performance headway that I was making in rewriting the code.

And beware that if you're not optimizing for a specific processor, you can make yourself crazy trying to find the best "generic" code. When it comes to the effect of alignment on speed, things can vary wildly. A poor alignment strategy is often worse than no alignment strategy at all.

A power-of-two boundary is always a good idea, but this pretty readily achieved without any extra effort. Again, don't dismiss alignment out of hand, because it can matter, but by the same token, don't obsess about trying to align every branch target.

Alignment used to be a bit bigger deal on the original Core 2 (Penryn and Nehalem) microarchitecture, where substantial decode bottlenecks meant that, despite a 4-wide issue width, you had a hard time keeping its execution units busy. With the introduction of the µop cache in Sandy Bridge (one of the few nice features of the Pentium 4 that was ultimately reintroduced into the P6 extended family), the front-end throughput was increased pretty significantly, and this became a lot less of a problem.

Frankly, compilers aren't very good at making these types of optimizations, either. The -O2 switch for GCC implies the -falign-functions, -falign-jumps, -falign-loops, and -falign-labels switches, with a default preference to align on 8-byte boundaries. This is a pretty blunt approach, and mileage varies. As I linked above, reports vary about whether disabling this alignment and going for compact code might actually increase performance. Moreover, about the best you're going to see a compiler doing is inserting multi-byte NOPs. I haven't seen one that uses longer forms of instructions or drastically rearranges code for alignment purposes. So we still have a long way to go, and it's a very difficult problem to solve. Some people are working on it, but that just goes to show how intractable the problem really is: "Small changes in the instruction stream, such as the insertion of a single NOP instruction, can lead to significant performance deltas, with the effect of exposing compiler and performance optimization efforts to perceived unwanted randomness." (Note that, while interesting, that paper hails from the early Core 2 days, which suffered more than most from misalignment penalties, as I mentioned earlier. I'm not sure if you'd see the same drastic improvements on today's microarchitectures, but I can't say for sure either way, because I haven't run the test. Maybe Google will hire me and I can publish another paper?)

How many 1-byte NOPs can Skylake execute in one cycle? What about other Intel-compatible processors, like AMD? I'm interested not only in Skylake but in other microarchitecrutes as well. How many cycles may it take to execute a sequence of 15 NOPs?

Questions like this can be answered by looking at Agner Fog's instruction tables and searching for NOP. I won't bother extracting all of his data into this answer.

In general, though, just know that NOPs are not free. Although they don't require an execution unit/port, they still have to run through the pipeline like any other instruction, and so they are ultimately bottlenecked by the issue (and/or retirement) width of the processor. This generally means you can execute somewhere between 3 to 5 NOPs per clock.

NOPs also still take up space in the µop cache, which mean reduced code density and cache efficiency.

In many ways, you can think of a NOP as being equivalent to a XOR reg, reg or MOV that gets elided in the front-end due to register renaming.

like image 131
Cody Gray Avatar answered Sep 28 '22 01:09

Cody Gray


Skylake can generally execute four single-byte nops in one cycle. This has been true at least back to the Sandy Bridge (hereafter SnB) micro-architecture.

Skylake, and others back to SnB, will also generally be able to execute four longer-than-one-byte nops in one cycle as well, unless they are so long as to run into front-end limitations.


The existing answers are much more complete and explain why you might not want to use such single-byte nop instructions so I won't add more, but it's nice to have one answer that just answers the headline question clearly, I think.

like image 33
BeeOnRope Avatar answered Sep 28 '22 01:09

BeeOnRope