I started to learn assembler, and this does not looks logical to me. Why can't I use multiple higher bytes in a register? I understand the historical reason of <code>rax</code>-><code>eax</code>-><code>ax</code>, so let's focus on new 64-bit registers. For example, I can use <code>r8</code> and <code>r8d</code>, but why not <code>r8dl</code> and <code>r8dh</code>? The same goes with <code>r8w</code> and <code>r8b</code>. My initial thinking was that I can use 8 <code>r8b</code> registers at the same time (like I can do with <code>al</code> and <code>ah</code> at the same time). But I can't. And using <code>r8b</code> makes the complete <code>r8</code> register "busy". Which raises the question - why? Why would you need to use only a part of a register if you can't use other parts at the same time? Why not just keep only <code>r8</code> and forget about the lower parts?

<blockquote> why can't I use multiple higher bytes in a register </blockquote> Every permutation of an instruction needs to be encoded in the instruction. The original 8086 processor supports the following options: <pre class="prettyprint"><code>instruction encoding remarks --------------------------------------------------------- mov ax,value b8 01 00 <-- whole register mov al,value b4 01 <-- lower byte mov ah,value b0 01 <-- upper byte </code></pre> Because the 8086 is a 16 bit processor three different versions cover all options. In the 80386 32-bit support was added. The designers had a choice, either add support for 3 additional sets of registers (x 8 registers = 24 new registers) and somehow find encodings for these, or leave things mostly as they were before. Here's what the designers opted for: <pre class="prettyprint"><code>instruction encoding remarks --------------------------------------------------------- mov eax,value b8 01 00 00 00 (same encoding as mov ax,value!) mov ax,value 66 b8 01 00 (prefix 66 + encoding for mov eax,value) mov al,value (same as before) mov ah,value (same as before) </code></pre> They simply added a <code>0x66</code> prefix to change the register size from the (now) default 32 to 16 bit plus a <code>0x67</code> prefix to change the memory operand size. And left it at that. To do otherwise would have meant doubling the number of instruction encodings or add <del>three</del> six new prefixes for each of your 'new' partial registers. By the time the 80386 came out all instruction bytes were already taken, so there was no space for new prefixes. This opcode space had been eaten up by useless instructions like <code>AAA</code>, <code>AAD</code>, <code>AAM</code>, <code>AAS</code>, <code>DAA</code>, <code>DAS</code> <code>SALC</code>. (These have been disabled in X64 mode to free up much needed encoding space). If you want to change only the higher bytes of a register, simply do: <pre class="prettyprint"><code>movzx eax,cl //mov al,cl, but faster shl eax,24 //mov al to high byte. </code></pre> <blockquote> But why not two (say r8dl and r8dh) </blockquote> In the original 8086 there were 8 byte sized registers: <pre class="prettyprint"><code>al,cl,dl,bl,ah,ch,dh,bh <-- in this order. </code></pre> The index registers, base pointer and stack reg do not have byte registers. In the x64 this was changed. If there is a <code>REX</code> prefix (denoting x64 registers) then <code>al..bh</code> (8 regs) encode <code>al</code>..<code>r15l</code>. 16 regs incl. 1 extra encoding bit from the rex prefix. This adds <code>spl</code>, <code>dil</code>, <code>sil</code>, <code>bpl</code>, but excludes any <code>xh</code> reg. (you can still get the four <code>xh</code> regs when not using a <code>rex</code> prefix). <blockquote> And using r8b makes the complete r8 "busy" </blockquote> Yes, this is called a 'partial register write'. Because writing <code>r8b</code> changes part, but not all of <code>r8</code>, <code>r8</code> is now split into two halves. One half has changed and one half has not. The CPU needs to join the two halves. It can either do this by using an extra CPU cycle to perform the work, or by adding more circuitry to the task to be able to do it in a single cycle. The latter is expensive in terms of silicon and complex in terms of design, it also adds extra heat because of the extra work being done (more work per cycle = more heat produced). See Why doesn't GCC use partial registers? for a run-down on how different x86 CPUs handle partial-register writes (and later reads of the full register). <blockquote> if I use r8b I can't access upper 56 bits at the same time, they exist, but unaccessible </blockquote> No they are not <code>unaccessible</code>. <pre class="prettyprint"><code>mov rax,bignumber //random value in eax mov al,0 //clear al xor r8d,r8d //r8=0 mov r8b,16 //set r8b or r8,rax //change r8 upper without changing r8b </code></pre> You use masks plus <code>and</code>, <code>or</code>, <code>xor</code> and <code>not and</code> to change parts of a register without affecting the rest of it. There really was never a need for <code>ah</code>, but it did lead to more compact code on 8086 (and effectively more usable registers). It's still sometimes useful to write EAX or RAX and then read AL and AH separately (e.g. <code>movzx ecx, al</code> / <code>movzx edx, ah</code>) as part of unpacking bytes.

Why can I access lower dword/word/byte in a register but not higher?

Tags:

x86

assembly

x86-64

64-bit

cpu-registers

I started to learn assembler, and this does not looks logical to me.

Why can't I use multiple higher bytes in a register?

I understand the historical reason of rax->eax->ax, so let's focus on new 64-bit registers. For example, I can use r8 and r8d, but why not r8dl and r8dh? The same goes with r8w and r8b.

My initial thinking was that I can use 8 r8b registers at the same time (like I can do with al and ah at the same time). But I can't. And using r8b makes the complete r8 register "busy".

Which raises the question - why? Why would you need to use only a part of a register if you can't use other parts at the same time? Why not just keep only r8 and forget about the lower parts?

967

asked Aug 04 '17 07:08

nikitablack

2 Answers

why can't I use multiple higher bytes in a register

Every permutation of an instruction needs to be encoded in the instruction. The original 8086 processor supports the following options:

instruction     encoding    remarks
---------------------------------------------------------
mov ax,value    b8 01 00    <-- whole register
mov al,value    b4 01       <-- lower byte
mov ah,value    b0 01       <-- upper byte

Because the 8086 is a 16 bit processor three different versions cover all options.
In the 80386 32-bit support was added. The designers had a choice, either add support for 3 additional sets of registers (x 8 registers = 24 new registers) and somehow find encodings for these, or leave things mostly as they were before.

Here's what the designers opted for:

instruction     encoding           remarks
---------------------------------------------------------
mov eax,value    b8 01 00 00 00    (same encoding as mov ax,value!)
mov ax,value     66 b8 01 00       (prefix 66 + encoding for mov eax,value)
mov al,value     (same as before)
mov ah,value     (same as before)

They simply added a 0x66 prefix to change the register size from the (now) default 32 to 16 bit plus a 0x67 prefix to change the memory operand size. And left it at that.

To do otherwise would have meant doubling the number of instruction encodings or add ~~three~~ six new prefixes for each of your 'new' partial registers.
By the time the 80386 came out all instruction bytes were already taken, so there was no space for new prefixes. This opcode space had been eaten up by useless instructions like AAA, AAD, AAM, AAS, DAA, DAS SALC. (These have been disabled in X64 mode to free up much needed encoding space).

If you want to change only the higher bytes of a register, simply do:

movzx eax,cl     //mov al,cl, but faster   
shl eax,24       //mov al to high byte.

But why not two (say r8dl and r8dh)

In the original 8086 there were 8 byte sized registers:

al,cl,dl,bl,ah,ch,dh,bh  <-- in this order.

The index registers, base pointer and stack reg do not have byte registers.

In the x64 this was changed. If there is a REX prefix (denoting x64 registers) then al..bh (8 regs) encode al..r15l. 16 regs incl. 1 extra encoding bit from the rex prefix. This adds spl, dil, sil, bpl, but excludes any xh reg. (you can still get the four xh regs when not using a rex prefix).

And using r8b makes the complete r8 "busy"

Yes, this is called a 'partial register write'. Because writing r8b changes part, but not all of r8, r8 is now split into two halves. One half has changed and one half has not. The CPU needs to join the two halves. It can either do this by using an extra CPU cycle to perform the work, or by adding more circuitry to the task to be able to do it in a single cycle.
The latter is expensive in terms of silicon and complex in terms of design, it also adds extra heat because of the extra work being done (more work per cycle = more heat produced). See Why doesn't GCC use partial registers? for a run-down on how different x86 CPUs handle partial-register writes (and later reads of the full register).

if I use r8b I can't access upper 56 bits at the same time, they exist, but unaccessible

No they are not unaccessible.

mov  rax,bignumber         //random value in eax
mov  al,0                  //clear al
xor  r8d,r8d               //r8=0
mov  r8b,16                //set r8b
or   r8,rax                //change r8 upper without changing r8b

You use masks plus and, or, xor and not and to change parts of a register without affecting the rest of it.

There really was never a need for ah, but it did lead to more compact code on 8086 (and effectively more usable registers). It's still sometimes useful to write EAX or RAX and then read AL and AH separately (e.g. movzx ecx, al / movzx edx, ah) as part of unpacking bytes.

answered Nov 09 '22 11:11

Johan

The general answer is that such access is costly in a few senses and rarely needed.

Since at least second half of 1980s, and deeply since 1990s, instruction sets are modelled mainly for compiler convenience, than human convenience. A compiler logic is much simpler when it projects set of variables with its defined sizes (8, 16, 32, 64 bits) onto a fixed set of registers, and each register is used exactly for one value at a time. Register overlap is very confusing to them. As result, compiler internally knows a single register "A" (or even R0) that is AL, AX, EAX or RAX, depending on operand size. To use AH, it shall get into attention that AX consists of AH and AL, which is out of current sight. Even if it generates instructions with AH (e.g. LAHF), internally it is likely treated as "operation that fills A with LowFlags*256". (In real, there are some hacks that smear this strong picture, but they are very local.)

This is merged with other compiler specifics. For example, GCC and Clang are deeply SSA based. As result, you will never see XCHG instruction in their output; if you found it somewhere in code, it's 100% manual-written assembly insertion. The same for RCL, RCR, even if they are suitable in some specific cases (e.g. divide uint32 by 7), likely for ROL, ROR. If AMD had dropped RCL, RCR from their x86-64 design, nobody would really have mourned these instructions.

This does not include vector facility that is modelled on different principles and orthogonal to the main one. When compiler decides to do 4 parallel uint32 actions on an XMM register, it can use PINS* instructions to replace a part of such register or PEXTR* to extract it, but, in that case, it tracks 2-4-8-16... values at a moment. But such vectorization doesn't apply to the main register set, at least in main state-of-the-art ISAs.

This movement in compilers has been having an ongoing and strengthening moving in hardware. It's easier to make 16-32 independent architectural registers and track (see register renaming) them individually (e.g. add 2 register sources and provide 1 register result) than provide each part of register separately and count an instruction that (for the same example) gets 16 single-byte sources and generate 8 single-byte results. (Thatʼs why x86-64 is designed that an 32-bit register write clears upper 32 bits of 64-bit register; but this is not done for 8- and 16-bit operations, because CPU has already got need to combine with upper bits of previous register value, for legacy reasons.)

There are some chances to see this changed in some future before a radical CPU design revolution, but I treat them as really minimal.

If you currently need access to part of registers, like e.g. bits 40-47 of RAX, this can be quite easily implemented with copyings and rotations. To extract it:

MOV RCX, RAX ; expect result in CL
SHR RCX, 40
MOVZX RCX, CL ; to clear all bits except 7-0

To replace value:

ROR RAX, 40
MOV AL, CL ; provided that CL is what to insert
ROL RAX, 40

these code chunks are linear and fast enough.

answered Nov 09 '22 12:11

Netch

Related questions
                            
                                Compiling an AST to Assembly
                            
                                What does double dollar sign mean in x86 assembly (NASM)
                            
                                assembly registers beginner
                            
                                WBINVD instruction usage
                            
                                16 bit asm instruction set
                            
                                How to implement the mod operator in assembly
                            
                                Using SIMD/AVX/SSE for tree traversal
                            
                                Call C standard library function from asm in Visual Studio
                            
                                Why PE need Original First Thunk(OFT)?
                            
                                Why do we need to compile for different platforms (e.g. Windows/Linux)?
                            
                                How do i read single character input from keyboard using nasm (assembly) under ubuntu?
                            
                                how to count cycles?
                            
                                Removing the prologue of a function written in pure assembly
                            
                                Calculate system time using rdtsc
                            
                                Key concepts to learn in Assembly
                            
                                Assembly fast division by 2
                            
                                LOCK prefix of Intel instruction. What is the point?
                            
                                Why is the dividend 64 bits in x86 assembly?
                            
                                Why does gcc push %rbx at the beginning of main?
                            
                                Goto a specific Address in C

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With