Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does "nop dword ptr [rax+rax]" x64 assembly instruction do?

Tags:

I'm trying to understand the x64 assembly optimization that is done by the compiler.

I compiled a small C++ project as Release build with Visual Studio 2008 SP1 IDE on Windows 8.1.

And one of the lines contained the following assembly code:

B8 31 00 00 00   mov         eax,31h
0F 1F 44 00 00   nop         dword ptr [rax+rax]

And here's a screenshot:

enter image description here

As far as I know nop by itself is do nothing, but I've never seen it with an operand like that.

Can someone explain what does it do?

like image 761
c00000fd Avatar asked May 16 '17 01:05

c00000fd


Video Answer


2 Answers

In a comment elsewhere on this page, Michael Petch points to a web page which describes the Intel x86 multi-byte NOP opcodes. The page has a table of useful information, but unfortunately the HTML is messed up so you can't read it. Here is some information from that page, plus that table presented a readable form:

Multi-Byte NOP
http://www.felixcloutier.com/x86/NOP.html
The one-byte NOP instruction is an alias mnemonic for the XCHG (E)AX, (E)AX instruction.

The multi-byte NOP instruction performs no operation on supported processors and generates undefined opcode exception on processors that do not support the multi-byte NOP instruction.

The memory operand form of the instruction allows software to create a byte sequence of “no operation” as one instruction.

For situations where multiple-byte NOPs are needed, the recommended operations (32-bit mode and 64-bit mode) are:     [my edit: in 64-bit mode, write rax instead of eax.]

Length    Assembly                                     Byte Sequence
-------   ------------------------------------------   --------------------------
1 byte    nop                                          90
2 bytes   66 nop                                       66 90
3 bytes   nop dword ptr [eax]                          0F 1F 00
4 bytes   nop dword ptr [eax + 00h]                    0F 1F 40 00
5 bytes   nop dword ptr [eax + eax*1 + 00h]            0F 1F 44 00 00
6 bytes   66 nop word ptr [eax + eax*1 + 00h]          66 0F 1F 44 00 00
7 bytes   nop dword ptr [eax + 00000000h]              0F 1F 80 00 00 00 00
8 bytes   nop dword ptr [eax + eax*1 + 00000000h]      0F 1F 84 00 00 00 00 00
9 bytes   66 nop word ptr [eax + eax*1 + 00000000h]    66 0F 1F 84 00 00 00 00 00


Note that the technique for selecting the right byte sequence--and thus the desired total size--may differ according to which assembler you are using.

For example, the following two lines of assembly taken from the table are ostensibly similar:

nop dword ptr [eax + 00h]
nop dword ptr [eax + 00000000h]

These differ only in the number of leading zeros, and some assemblers may make it hard to disable their "helpful" feature of always encoding the shortest possible byte sequence, which could make the second expression inaccessible.

For the multi-byte NOP situation, you don't want this "help" because you need to make sure that you actually get the desired number of bytes. So the issue is how to specify an exact combination of mod and r/m bits that ends up with the desired disp size--but via instruction mnemonics alone. This topic is complex, and certainly beyond the scope of my knowledge, but Scaled Indexing, MOD+R/M and SIB might be a starting place.

Now as I know you were just thinking, if you find it difficult or impossible to coerce your assembler's cooperation via instruction mnemonics you can always just resort to db ("define bytes") as a simple no-fuss alternative which is, um, guaranteed to work.

like image 171
Glenn Slayden Avatar answered Sep 19 '22 17:09

Glenn Slayden


As pointed out in the comments, it is a multi-byte NOP usually used to align the subsequent instruction to a 16-byte boundary, when that instruction is the first instruction in a loop.

Such alignment can help with instruction fetch bandwidth, because instruction fetch often happens in units of 16 bytes, so aligning the top of a loop gives the greatest chance that the decoding occurs without bottlenecks.

The importance of such alignment is arguably less important than it once was, with the introduction of the loop buffer and the uop cache which are less sensitive to alignment. In some cases this optimization may even be a pessimization, especially when the loop executes very few times.

like image 23
2 revs Avatar answered Sep 19 '22 17:09

2 revs