Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why can I access lower dword/word/byte in a register but not higher?

I started to learn assembler, and this does not looks logical to me.

Why can't I use multiple higher bytes in a register?

I understand the historical reason of rax->eax->ax, so let's focus on new 64-bit registers. For example, I can use r8 and r8d, but why not r8dl and r8dh? The same goes with r8w and r8b.

My initial thinking was that I can use 8 r8b registers at the same time (like I can do with al and ah at the same time). But I can't. And using r8b makes the complete r8 register "busy".

Which raises the question - why? Why would you need to use only a part of a register if you can't use other parts at the same time? Why not just keep only r8 and forget about the lower parts?

like image 967
nikitablack Avatar asked Aug 04 '17 07:08

nikitablack


People also ask

What is the name of the higher 16 bits of the EDX register?

Segment Registers A 16-bit Data Segment register or DS register stores the starting address of the data segment.

What is the name of the highest 8 bits of register AX?

ax is the 16-bit, "short" size register. It was added in 1979 with the 8086 CPU, but is used in DOS or BIOS code to this day. al and ah are the 8-bit, "char" size registers. al is the low 8 bits, ah is the high 8 bits.

What is the difference between byte and word?

A byte is eight bits, a word is 2 bytes (16 bits), a doubleword is 4 bytes (32 bits), and a quadword is 8 bytes (64 bits).

What is byte register?

The Status Byte Register contains information about the event registers and the output queue. Required items are selected from this information by masking with the Service Request Enable Register.


2 Answers

why can't I use multiple higher bytes in a register

Every permutation of an instruction needs to be encoded in the instruction. The original 8086 processor supports the following options:

instruction     encoding    remarks
---------------------------------------------------------
mov ax,value    b8 01 00    <-- whole register
mov al,value    b4 01       <-- lower byte
mov ah,value    b0 01       <-- upper byte

Because the 8086 is a 16 bit processor three different versions cover all options.
In the 80386 32-bit support was added. The designers had a choice, either add support for 3 additional sets of registers (x 8 registers = 24 new registers) and somehow find encodings for these, or leave things mostly as they were before.

Here's what the designers opted for:

instruction     encoding           remarks
---------------------------------------------------------
mov eax,value    b8 01 00 00 00    (same encoding as mov ax,value!)
mov ax,value     66 b8 01 00       (prefix 66 + encoding for mov eax,value)
mov al,value     (same as before)
mov ah,value     (same as before)

They simply added a 0x66 prefix to change the register size from the (now) default 32 to 16 bit plus a 0x67 prefix to change the memory operand size. And left it at that.

To do otherwise would have meant doubling the number of instruction encodings or add three six new prefixes for each of your 'new' partial registers.
By the time the 80386 came out all instruction bytes were already taken, so there was no space for new prefixes. This opcode space had been eaten up by useless instructions like AAA, AAD, AAM, AAS, DAA, DAS SALC. (These have been disabled in X64 mode to free up much needed encoding space).

If you want to change only the higher bytes of a register, simply do:

movzx eax,cl     //mov al,cl, but faster   
shl eax,24       //mov al to high byte.

But why not two (say r8dl and r8dh)

In the original 8086 there were 8 byte sized registers:

al,cl,dl,bl,ah,ch,dh,bh  <-- in this order.

The index registers, base pointer and stack reg do not have byte registers.

In the x64 this was changed. If there is a REX prefix (denoting x64 registers) then al..bh (8 regs) encode al..r15l. 16 regs incl. 1 extra encoding bit from the rex prefix. This adds spl, dil, sil, bpl, but excludes any xh reg. (you can still get the four xh regs when not using a rex prefix).

And using r8b makes the complete r8 "busy"

Yes, this is called a 'partial register write'. Because writing r8b changes part, but not all of r8, r8 is now split into two halves. One half has changed and one half has not. The CPU needs to join the two halves. It can either do this by using an extra CPU cycle to perform the work, or by adding more circuitry to the task to be able to do it in a single cycle.
The latter is expensive in terms of silicon and complex in terms of design, it also adds extra heat because of the extra work being done (more work per cycle = more heat produced). See Why doesn't GCC use partial registers? for a run-down on how different x86 CPUs handle partial-register writes (and later reads of the full register).

if I use r8b I can't access upper 56 bits at the same time, they exist, but unaccessible

No they are not unaccessible.

mov  rax,bignumber         //random value in eax
mov  al,0                  //clear al
xor  r8d,r8d               //r8=0
mov  r8b,16                //set r8b
or   r8,rax                //change r8 upper without changing r8b  

You use masks plus and, or, xor and not and to change parts of a register without affecting the rest of it.

There really was never a need for ah, but it did lead to more compact code on 8086 (and effectively more usable registers). It's still sometimes useful to write EAX or RAX and then read AL and AH separately (e.g. movzx ecx, al / movzx edx, ah) as part of unpacking bytes.

like image 73
Johan Avatar answered Nov 09 '22 11:11

Johan


The general answer is that such access is costly in a few senses and rarely needed.

Since at least second half of 1980s, and deeply since 1990s, instruction sets are modelled mainly for compiler convenience, than human convenience. A compiler logic is much simpler when it projects set of variables with its defined sizes (8, 16, 32, 64 bits) onto a fixed set of registers, and each register is used exactly for one value at a time. Register overlap is very confusing to them. As result, compiler internally knows a single register "A" (or even R0) that is AL, AX, EAX or RAX, depending on operand size. To use AH, it shall get into attention that AX consists of AH and AL, which is out of current sight. Even if it generates instructions with AH (e.g. LAHF), internally it is likely treated as "operation that fills A with LowFlags*256". (In real, there are some hacks that smear this strong picture, but they are very local.)

This is merged with other compiler specifics. For example, GCC and Clang are deeply SSA based. As result, you will never see XCHG instruction in their output; if you found it somewhere in code, it's 100% manual-written assembly insertion. The same for RCL, RCR, even if they are suitable in some specific cases (e.g. divide uint32 by 7), likely for ROL, ROR. If AMD had dropped RCL, RCR from their x86-64 design, nobody would really have mourned these instructions.

This does not include vector facility that is modelled on different principles and orthogonal to the main one. When compiler decides to do 4 parallel uint32 actions on an XMM register, it can use PINS* instructions to replace a part of such register or PEXTR* to extract it, but, in that case, it tracks 2-4-8-16... values at a moment. But such vectorization doesn't apply to the main register set, at least in main state-of-the-art ISAs.

This movement in compilers has been having an ongoing and strengthening moving in hardware. It's easier to make 16-32 independent architectural registers and track (see register renaming) them individually (e.g. add 2 register sources and provide 1 register result) than provide each part of register separately and count an instruction that (for the same example) gets 16 single-byte sources and generate 8 single-byte results. (Thatʼs why x86-64 is designed that an 32-bit register write clears upper 32 bits of 64-bit register; but this is not done for 8- and 16-bit operations, because CPU has already got need to combine with upper bits of previous register value, for legacy reasons.)

There are some chances to see this changed in some future before a radical CPU design revolution, but I treat them as really minimal.

If you currently need access to part of registers, like e.g. bits 40-47 of RAX, this can be quite easily implemented with copyings and rotations. To extract it:

MOV RCX, RAX ; expect result in CL
SHR RCX, 40
MOVZX RCX, CL ; to clear all bits except 7-0

To replace value:

ROR RAX, 40
MOV AL, CL ; provided that CL is what to insert
ROL RAX, 40

these code chunks are linear and fast enough.

like image 24
Netch Avatar answered Nov 09 '22 12:11

Netch