I am working through Kip Irvine's "Assembly Language for x86 Processors, sixth edition" and am really enjoying it.
I have just read about the NOP mnemonic in the following paragraph:
"It [NOP] is sometimes used by compilers and assemblers to align code to even-address boundaries."
The example given is:
00000000 66 8B C3 mov ax, bx 00000003 90 nop 00000004 8B D1 mov edx, ecx
The book then states:
"x86 processors are designed to load code and data more quickly from even doubleword addresses."
My question is: Is the reason this is so is because for the x86 processors the book refers to (32 bit), the word size of the CPU is 32 bits and therefore it can pull the instructions with the NOP in and process them in one go ? If this is the case, I am assuming that a 64 bit processor with a word size of a quadword would do this with a hypothetical 5 bytes of code plus a nop ?
Lastly, after I write my code, should I go through and correct alignment with NOP's to optimize it, or will the compiler (MASM, in my case), do this for me, as the text seems to imply ?
Thanks,
Scott
The ALIGN directive aligns the current location to a specified boundary by padding with zeros or NOP instructions.
Data that's aligned on a 16 byte boundary will have a memory address that's an even number — strictly speaking, a multiple of two. Each byte is 8 bits, so to align on a 16 byte boundary, you need to align to each set of two bytes.
x86 assembly language is the name for the family of assembly languages which provide some level of backward compatibility with CPUs back to the Intel 8008 microprocessor, which was launched in April 1972. It is used to produce object code for the x86 class of processors.
A memory address a is said to be n-byte aligned when a is a multiple of n (where n is a power of 2). In this context, a byte is the smallest unit of memory access, i.e. each memory address specifies a different byte.
Code that's executed on word (for 8086) or DWORD (80386 and later) boundaries executes faster because the processor fetches whole (D)words. So if your instructions aren't aligned then there is a stall when loading.
However, you can't dword-align every instruction. Well, I guess you could, but then you'd be wasting space and the processor would have to execute the NOP instructions, which would kill any performance benefit of aligning the instructions.
In practice, aligning code on dword (or whatever) boundaries only helps when the instruction is the target of a branching instruction, and compilers typically will align the first instruction of a function, but won't align branch targets that can also be reached by fall through. For example:
MyFunction: cmp ax, bx jnz NotEqual ; ... some code here NotEqual: ; ... more stuff here
A compiler that generates this code will typically align MyFunction
because it is a branch target (reached by call
), but it won't align the NotEqual
because doing so would insert NOP
instructions that would have to be executed when falling through. That increases code size and makes the fall-through case slower.
I would suggest that if you're just learning assembly language, that you don't worry about things like this that will most often give you marginal performance gains. Just write your code to make things work. After they work, you can profile them and, if you think it's necessary after looking at the profile data, align your functions.
The assembler typically won't do it for you automatically.
Because the (16 bit) processor can fetch values from memory only at even addresses, due to its particular layout: it is divided in two "banks" of 1 byte each, so half of the data bus is connected to the first bank and the other half to the other bank. Now, suppose these banks are aligned (as in my picture), the processor can fetch values that are on the same "row".
bank 1 bank 2 +--------+--------+ | 8 bit | 8 bit | +--------+--------+ | | | +--------+--------+ | 4 | 5 | <-- the CPU can fetch only values on the same "row" +--------+--------+ | 2 | 3 | +--------+--------+ | 0 | 1 | +--------+--------+ \ / \ / | | | | | | | | data bus (to uP)
Now, since this fetch limitation, if the cpu is forced to fetch values which are located on an odd address (suppose 3), it has to fetch values at 2 and 3, then values at 4 and 5, throw away values 2 and 5 then join 4 and 3 (you are talking about x86, which as a little endian memory layout).
That's why is better having code (and data!) on even addresses.
PS: On 32 bit processors, code and data should be aligned on addresses which are divisible by 4 (since there are 4 banks).
Hope I was clear. :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With