I have some unknown C++ code that was compiled in Release build, so it's optimized. The point I'm struggling with is: <pre class="prettyprint"><code>xor al, al add esp, 8 cmp byte ptr [ebp+userinput], 31h movzx eax, al </code></pre> This is my understanding: <pre class="prettyprint"><code>xor al, al ; set eax to 0x??????00 (clear last byte) add esp, 8 ; for some unclear reason, set the stack pointer higher cmp byte ptr [ebp+userinput], 31h ; set zero flag if user input was "1" movzx eax, al ; set eax to AL and extend with zeros, so eax = 0x000000?? </code></pre> I don't care about line 2 and 3. They might be there in this order for pipelining reasons and IMHO have nothing to do with EAX. However, I don't understand why I would clear AL first, just to clear the rest of EAX later. The result will IMHO always be <code>EAX = 0</code>, so this could also be <pre class="prettyprint"><code>xor eax, eax </code></pre> instead. What is the advantage or "optimization" of that piece of code? Some background info: I will get the source code later. It's a short C++ console demo program, maybe 20 lines of code only, so nothing that I would call "complex" code. IDA shows a single loop in that program, but not around this piece. The Stud_PE signature scan didn't find anything, but likely it's Visual Studio 2013 or 2015 compiler.

<code>xor al,al</code> is already slower than <code>xor eax,eax</code> on most CPUs. e.g. on Haswell/Skylake it needs an ALU uop and doesn't break the dependency on the old value of <code>eax</code>/<code>rax</code>. It's equally bad on AMD CPUs, or Atom/Silvermont. (Well, maybe not equally because AMD doesn't eliminate <code>xor eax,eax</code> at issue/rename, but it still has a false dependency which could serialize the new dependency chain with whatever used <code>eax</code> last). On CPUs that do rename <code>al</code> separately from the rest of the register (Intel pre-IvyBridge), the <code>xor al,al</code> may still be recognized as a zeroing idiom, but unless you actively want to preserve the upper bytes of the register, the best way to zero <code>al</code> is <code>xor eax,eax</code>. Doing <code>movzx</code> on top of that just makes it even worse. <hr> I'm guessing your compiler somehow got confused and decided it needed a 1-byte zero, but then realized it needed to promote it to 32 bits. <code>xor</code> sets flags, so it couldn't <code>xor</code>-zero after the <code>cmp</code>, and it failed to notice that it could have just xor-zeroed <code>eax</code> before the <code>cmp</code>. Either that or it's something like Jester's suggestion, where the <code>movzx</code> is a branch target. Even if that's the case, <code>xor eax,eax</code> would still have been better because zero-extending into eax follows unconditionally on this code path. I'm curious what compiler produced this from what source.

Any advantage of XOR AL,AL + MOVZX EAX, AL over XOR EAX,EAX?

Tags:

c++

x86

assembly

I have some unknown C++ code that was compiled in Release build, so it's optimized. The point I'm struggling with is:

xor     al, al
add     esp, 8
cmp     byte ptr [ebp+userinput], 31h
movzx   eax, al

This is my understanding:

xor     al, al    ; set eax to 0x??????00 (clear last byte)
add     esp, 8    ; for some unclear reason, set the stack pointer higher
cmp     byte ptr [ebp+userinput], 31h ; set zero flag if user input was "1"
movzx   eax, al   ; set eax to AL and extend with zeros, so eax = 0x000000??

I don't care about line 2 and 3. They might be there in this order for pipelining reasons and IMHO have nothing to do with EAX.

However, I don't understand why I would clear AL first, just to clear the rest of EAX later. The result will IMHO always be EAX = 0, so this could also be

xor eax, eax

instead. What is the advantage or "optimization" of that piece of code?

Some background info:

I will get the source code later. It's a short C++ console demo program, maybe 20 lines of code only, so nothing that I would call "complex" code. IDA shows a single loop in that program, but not around this piece. The Stud_PE signature scan didn't find anything, but likely it's Visual Studio 2013 or 2015 compiler.

663

asked Nov 08 '17 21:11

Thomas Weller

1 Answers

xor al,al is already slower than xor eax,eax on most CPUs. e.g. on Haswell/Skylake it needs an ALU uop and doesn't break the dependency on the old value of eax/rax. It's equally bad on AMD CPUs, or Atom/Silvermont. (Well, maybe not equally because AMD doesn't eliminate xor eax,eax at issue/rename, but it still has a false dependency which could serialize the new dependency chain with whatever used eax last).

On CPUs that do rename al separately from the rest of the register (Intel pre-IvyBridge), the xor al,al may still be recognized as a zeroing idiom, but unless you actively want to preserve the upper bytes of the register, the best way to zero al is xor eax,eax.

Doing movzx on top of that just makes it even worse.

I'm guessing your compiler somehow got confused and decided it needed a 1-byte zero, but then realized it needed to promote it to 32 bits. xor sets flags, so it couldn't xor-zero after the cmp, and it failed to notice that it could have just xor-zeroed eax before the cmp.

Either that or it's something like Jester's suggestion, where the movzx is a branch target. Even if that's the case, xor eax,eax would still have been better because zero-extending into eax follows unconditionally on this code path.

I'm curious what compiler produced this from what source.

152

answered Sep 28 '22 07:09

Peter Cordes

Related questions
                            
                                QAbstractProxyModel not updating on dataChanged() signal
                            
                                Getting the address of an inline defined friend function
                            
                                c++ static lambda performance
                            
                                How can I know if C++ compiler make thread-safe static object code?
                            
                                Real-life example of bug caused by identifier starting with an underscore [closed]
                            
                                How to resolve conflicting declaration of two header files of two libraries?
                            
                                QJsonObject ::insert compared to direct assignment to QJsonValueRef?
                            
                                Merging of ETL files has failed (0x80071069) (Flags: 0x0000001f)
                            
                                Why is traversing more time-consuming than merging on two sorted std::list?
                            
                                C++ template template non-type parameter
                            
                                How do I properly export symbols into a windows DLL using MinGW?
                            
                                Using templates instead of bridge pattern in C++
                            
                                How to force black and white output in google benchmark
                            
                                size_t and memory allocation
                            
                                DllImport - DllNotFoundException with android .so
                            
                                What is the purpose of standard-layout guarantees for "black box" types?
                            
                                erroneous explicit template specialization of variable template in gcc
                            
                                Reverse iterators for C arrays
                            
                                How do I use the winhttp api with "transfer-encoding: chunked"
                            
                                Wrong pack expanded in variadic template

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With