Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compiler generates costly MOVZX instruction

My profiler has identified the following function profiling as the hotspot.

typedef unsigned short ushort;

bool isInteriorTo( const std::vector<ushort>& point , const ushort* coord , const ushort dim )
{
    for( unsigned i = 0; i < dim; ++i )
    {
        if( point[i + 1] >= coord[i] ) return false;
    }

    return true;  
}

In particular one assembly instruction MOVZX (Move with Zero-Extend) is responsible for the bulk of the runtime. The if statement is compiled into

mov     rcx, QWORD PTR [rdi]
lea     r8d, [rax+1]
add     rsi, 2
movzx   r9d, WORD PTR [rsi-2]
mov     rax, r8
cmp     WORD PTR [rcx+r8*2], r9w
jae     .L5

I'd like to coax the compiler out of generating this instruction but I suppose I first need to understand why this instruction is generated. Why the widening/zero extension, considering that I'm working with the same data type?

(Find the entire function on godbolt compiler explorer.)

like image 929
Olumide Avatar asked Apr 19 '17 09:04

Olumide


People also ask

What does movzx do?

Description. movzx reads the contents of the register or effective address as a word or byte. movzx then sign-extends the 16- or 32-bit value to the operand-size attribute of the instruction. The result is stored in the destination register by movzx.

What is Al in assembly?

al and ah are the 8-bit, "char" size registers. al is the low 8 bits, ah is the high 8 bits. They're pretty similar to the old 8-bit registers of the 8008 back in 1972.


1 Answers

Thank you for the good question!

Clearing Registers and Dependency Breaking Idioms

A Quote from the Intel® 64 and IA-32 Architectures Optimization Reference Manual, Section 3.5.1.8:

Code sequences that modifies partial register can experience some delay in its dependency chain, but can be avoided by using dependency breaking idioms. In processors based on Intel Core microarchitecture, a number of instructions can help clear execution dependency when software uses these instructions to clear register content to zero. Break dependences on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.

Assembly/Compiler Coding Rule 37. (M impact, MH generality): Break dependences on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.

movzx vs mov

The compiler knows that movzx is not costly and uses it as often as possible. It may take more bytes to encode movzx than mov, but it is not expensive to execute.

Contrary to the logic, a program with movzx (that fills the entire registers) actually works faster than with just mov, which only sets lower parts of the registers.

Let me demonstrate this conclusion to you on the following code fragment. It is part of the code that implements CRC-32 calculation using the Slicing by-N algorithm. Here it is:

    movzx   ecx, bl
    shr     ebx, 8
    mov     eax, dword ptr [ecx * 4 + edi + 1024 * 3]

    movzx   ecx, bl
    shr     ebx, 8
    xor     eax, dword ptr [ecx * 4 + edi + 1024 * 2]

    movzx   ecx, bl
    shr     ebx, 8
    xor     eax, dword ptr [ecx * 4 + edi + 1024 * 1]
    
    skipped 6 more similar triplets that do movzx, shr, xor.
    
    dec     <<<a counter register >>>>
    jnz     …… <<repeat the whole loop again>>>

Here is the second code fragment. We have cleared ecx in advance, and now just instead of “movzx ecx, bl” do “mov cl, bl”:

    // ecx is already cleared here to 0

    mov     cl, bl
    shr     ebx, 8
    mov     eax, dword ptr [ecx * 4 + edi + 1024 * 3]

    mov     cl, bl
    shr     ebx, 8
    xor     eax, dword ptr [ecx * 4 + edi + 1024 * 2]

    mov     cl, bl
    shr     ebx, 8
    xor     eax, dword ptr [ecx * 4 + edi + 1024 * 1]
    
    <<< and so on – as in the example #1>>>

Now guess which of the two above code fragments runs faster? Did you think previously that the speed is the same, or the movzx version is slower? In fact, the movzx code is faster because all the CPUs since Pentium Pro do Out-Of-Order execution of instructions and register renaming.

Register Renaming

Register renaming is a technique used internally by a CPU that eliminates the false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them.

Let me just take the first 4 instructions from the first code fragment:

  1.     movzx   ecx, bl
    
  2.     shr     ebx, 8
    
  3.     mov     eax, dword ptr [ecx * 4 + edi + 1024 * 3]
    
  4.     movzx   ecx, bl
    

As you see, instruction 4 depends on instruction 2. Instruction 4 does not rely on the result of instruction 3.

So the CPU could execute instructions 3 and 4 in parallel (together), but instruction 3 uses the register (read-only) modified by instruction 4, thus instruction 4 may only start executing after instruction 3 fully completes. Let us then rename the register ecx to edx after the first triplet to avoid this dependency:

    movzx   ecx, bl
    shr     ebx, 8
    mov     eax, dword ptr [ecx * 4 + edi + 1024 * 3]

    movzx   edx, bl
    shr     ebx, 8
    xor     eax, dword ptr [edx * 4 + edi + 1024 * 2]

    movzx   ecx, bl
    shr     ebx, 8
    xor     eax, dword ptr [ecx * 4 + edi + 1024 * 1]

Here is what we have now:

  1.     movzx   ecx, bl
    
  2.     shr     ebx, 8
    
  3.     mov     eax, dword ptr [ecx * 4 + edi + 1024 * 3]
    
  4.     movzx   edx, bl
    

Now instruction 4 in no way uses any register needed for instruction 3, and vice versa, so instructions 3 and 4 can execute simultaneously for sure!

This is what the CPU does for us. The CPU, when translating instructions to micro-operations (micro-ops) which the Out-of-order algorithm will execute, renames the registers internally to eliminate these dependencies, so the micro-ops deal with renamed, internal registers, rather than with the real ones as we know them. Thus we don't need to rename registers ourselves as I have just renamed in the above example – the CPU will automatically rename everything for us while translating instructions to micro-ops.

The micro-ops of instruction 3 and instruction 4 will be executed in parallel, since micro-ops of instruction 4 will deal with entirely different internal register (exposed to outside as ecx) than micro-ops of instruction 3, so we don't need to rename anything.

Let me revert the code to the initial version. Here it is:

  1.     movzx   ecx, bl
    
  2.     shr     ebx, 8
    
  3.     mov     eax, dword ptr [ecx * 4 + edi + 1024 * 3]
    
  4.     movzx   ecx, bl
    

(instructions 3 and 4 run in parallel because ecx of instruction 3 is not that ecx as of instruction 4, but a different, renamed register – the CPU has automatically allocated for instruction 4 micro-ops a new, fresh register from the pool of internally available registers).

Now let us go back to movxz vs mov.

Movzx clears a register entirely, so the CPU for sure knows that we do not depend on any previous value that remained in higher bits of the register. When the CPU sees the movxz instruction, it knows that it can safely rename the register internally and execute the instruction in parallel with previous instructions. Now take the first 4 instructions from our example #2, where we use mov rather than movzx:

  1.    mov     cl, bl
    
  2.    shr     ebx, 8
    
  3.    mov     eax, dword ptr [ecx * 4 + edi + 1024 * 3]
    
  4.    mov     cl, bl
    

In this case, instruction 4, by modifying cl, modifies bits 0-7 of the ecx, leaving bits 8-32 unchanged. Thus the CPU cannot just rename the register for instruction 4 and allocate another, fresh register, because instruction 4 depends on bits 8-32 left from previous instructions. The CPU has to preserve bits 8-32 before it can execute instruction 4. Thus it cannot just rename the register. It will wait until instruction 3 completes before executing instruction 4. Instruction 4 didn't become fully independent - it depends on the previous value of ECX and the previous value of bl. So it depends on two registers at once. If we had used movzx, it would have depended on just one register - bl. Consequently, instructions 3 and 4 would not run in parallel because of their interdependence. Sad but true.

That's why it is always faster to operate complete registers. Suppose we need only to modify a part of the register. In that case, it's always quicker to alter the entire register (for example, use movzx) – to let the CPU know for sure that the register no longer depends on its previous value. Modifying complete registers allows the CPU to rename the register and let the Out-of-order execution algorithm execute this instruction together with the other instructions, rather than execute them one-by-one.

like image 77
Maxim Masiutin Avatar answered Oct 28 '22 08:10

Maxim Masiutin