Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does MSVC emit a useless MOVSX before performing this Bit Test?

Compiling the following code in MSVC 2013, 64-bit release build, /O2 optimization:

while (*s == ' ' || *s == ',' || *s == '\r' || *s == '\n') {
    ++s;
}

I get the following code - which has a really cool optimization using a 64-bit register as a lookup table with the bt (bit test) instruction.

    mov     rcx, 17596481020928             ; 0000100100002400H
    npad    5
$LL82@myFunc:
    movzx   eax, BYTE PTR [rsi]
    cmp     al, 44                          ; 0000002cH
    ja      SHORT $LN81@myFunc
    movsx   rax, al
    bt      rcx, rax
    jae     SHORT $LN81@myFunc
    inc     rsi
    jmp     SHORT $LL82@myFunc
$LN81@myFunc:
    ; code after loop...

But my question is: what is the purpose of the movsx rax, al after the first branch?

First we load a byte from the string into rax and zero-extend it:

movzx eax, BYTE PTR [rsi]

Then the cmp/ja pair performs an unsigned comparison between al and 44, and branches forwards if al is greater.

So now, we know 0 <= al <= 44 in unsigned numbers. Therefore, the highest bit of al could not possibly be set!

Nonetheless, the next instruction is movsx rax, al. This is a sign-extended move. But since:

  • al is the lowest byte of rax
  • we already know the other 7 bytes of rax are zeroed
  • we just proved that al's highest bit could not possibly be set

this movsx must be a no-op.

Why does MSVC do it? I'm assuming it's not for padding, since in that case another npad would make the meaning clearer. Is it flushing data dependencies or something?

(By the way, this bt optimization really makes me happy. Some interesting facts: it runs in 0.6x the time of the 4 cmp/je pairs you might expect, it's way faster than strspn or std::string::find_first_not_of, and it only happens in 64-bit builds even if the characters of interest have values under 32.)

like image 974
japreiss Avatar asked Sep 30 '14 15:09

japreiss


2 Answers

You surely recognize that this optimization was produced by very specific code in the optimizer that looked for the pattern. Just the generation of the bit-mask gives it away. Yes, nice trick.

There are two basic codegen cases here. First one is the more universal one, where (charmax - charmin <= 64) but charmax >= 64. The optimizer needs to generate different code from what you saw, it needs to subtract charmin. That version does not have the MOVSX instruction. You can see it by replacing *s == ' ' by *s == 'A'.

Then there's the special case that you tested, all character codes to test happen to be < 64. The Microsoft programmer did deal with this in his code, he made sure not to generate a silly SUB EAX,0 instruction. But overlooked that generating the MOVSX wasn't necessary. Surely missed by only checking for optimal code in the general case. And a general function call in the code, so easy to overlook, note how the instruction changes to MOVZX when you compile with /J. Otherwise easily deemed necessary, there is no BT instruction that takes an 8-bit register as the 2nd operand so the AL register load isn't enough by itself.

There could be a hypothetical post-optimizer optimizer that optimizes the optimized code generated by the optimizer. And decided to keep MOVSX to improve superscalar execution. I seriously doubt it exists.

like image 87
2 revs Avatar answered Oct 08 '22 17:10

2 revs


As Hans Passant already mentioned your code is a special case of the optimization. I did not look into the compiler/optimizer sources so I can only say what I expect to happen. However, here is my explanation for this.

Your code in C / C++:

while (*s == ' ' || *s == ',' || *s == '\r' || *s == '\n') {
    ++s;
}

is equivalent to:

while (*s == 32 || *s == 44 || *s == 13 || *s == 12) {
    ++s;
}

or:

while (*s == 12 || *s == 13 || *s == 32 || *s == 44) {
   ++s;
}

The optimizer detects that there is an "if" with multiple (>=3 times) comparisons of the same character. Now, the optimizer generates as many 64bit patterns as needed (up to 4 patterns for an 8 bit char, 64*4 => all 256 possible values). Each of this bit patterns has an offset (= test-value to which the first bit in the pattern corresponds to) that needs to be subtracted from the value in test. In your case only one 64bit-pattern is needed for the values ranging from 12 to 44. So your code gets converted into some generic code like this:

#define ranged_bittest(pattern, value, min_val, max_val) \
  (val >= min_val) && (val <= max_val) && BT_with_offset(pattern, val, min_val)

while ( !ranged_bittest(<BITPATTERN>, *s, 12, 44) ) {
    ++s;
}

Now some other optimization takes ranged_bittest(<BITPATTERN>, *s, 12, 44); as it detects a bittest with start offset 12 and a pattern that can be safely shiftet left by 12 bits (as the upper 12 bits of pattern are zero). ranged_bittest(<BITPATTERN>, *s, 12, 44) => ranged_bittest(<BITPATTERN> << 12, *s, 0, 44)

This evolves to:

while (!(val >= 0 && val <= 44 && BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0))) {
    ++s;
}

The val >= 0 && gets optimized away (as it is assumed to be always true) and the "while" is translated to some "do"-loop with breaks; (which is better for branch prediction in most cases) resulting in:

do {
  if (val > 44) break;
  if (BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0)) break;

  ++s;
} while (true);

For any reason the optimization stops here (e.g. because there is a limit on how often further optimizations are applied to a code fragment for performance reasons).

Now in assembly the line

if (BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0)) break;`

is translated to something like:

MOV <reg64_A>, const_pattern
MOV <reg64_B>, value
SUB <reg64_B>, const_offset
BT <reg64_A>, <reg64_B>

The assembly optimizer reduces:

MOV <reg64_B>, value
SUB <reg64_B>, 0

to just

MOVSX <reg64_B>, value

in your special case this is:

MOVSX rax, al

The assembly optimizer (somehow) does not have enough information to completely eliminate the sign extension. Maybe the assembly optimizer is quite "dumb" (can't do any logical expression optimizations and stuff, just simple replacements) or the full optimization level isn't activated yet.

It can't therefore remove the movsx rax,al, as it doesn't have any meta information about the possible values of rax or al at this point of code.

I don't KNOW, if this is true, but this my best guess for this case...

like image 25
SDwarfs Avatar answered Oct 08 '22 16:10

SDwarfs