Compiling the following code in MSVC 2013, 64-bit release build, /O2
optimization:
while (*s == ' ' || *s == ',' || *s == '\r' || *s == '\n') {
++s;
}
I get the following code - which has a really cool optimization using a 64-bit register as a lookup table with the bt
(bit test) instruction.
mov rcx, 17596481020928 ; 0000100100002400H
npad 5
$LL82@myFunc:
movzx eax, BYTE PTR [rsi]
cmp al, 44 ; 0000002cH
ja SHORT $LN81@myFunc
movsx rax, al
bt rcx, rax
jae SHORT $LN81@myFunc
inc rsi
jmp SHORT $LL82@myFunc
$LN81@myFunc:
; code after loop...
But my question is: what is the purpose of the movsx rax, al
after the first branch?
First we load a byte from the string into rax
and zero-extend it:
movzx eax, BYTE PTR [rsi]
Then the cmp
/ja
pair performs an unsigned comparison between al
and 44
, and branches forwards if al
is greater.
So now, we know 0 <= al <= 44
in unsigned numbers. Therefore, the highest bit of al
could not possibly be set!
Nonetheless, the next instruction is movsx rax, al
. This is a sign-extended move. But since:
al
is the lowest byte of rax
rax
are zeroedal
's highest bit could not possibly be setthis movsx
must be a no-op.
Why does MSVC do it? I'm assuming it's not for padding, since in that case another npad
would make the meaning clearer. Is it flushing data dependencies or something?
(By the way, this bt
optimization really makes me happy. Some interesting facts: it runs in 0.6x the time of the 4 cmp
/je
pairs you might expect, it's way faster than strspn
or std::string::find_first_not_of
, and it only happens in 64-bit builds even if the characters of interest have values under 32.)
You surely recognize that this optimization was produced by very specific code in the optimizer that looked for the pattern. Just the generation of the bit-mask gives it away. Yes, nice trick.
There are two basic codegen cases here. First one is the more universal one, where (charmax - charmin <= 64) but charmax >= 64. The optimizer needs to generate different code from what you saw, it needs to subtract charmin. That version does not have the MOVSX instruction. You can see it by replacing *s == ' '
by *s == 'A'
.
Then there's the special case that you tested, all character codes to test happen to be < 64. The Microsoft programmer did deal with this in his code, he made sure not to generate a silly SUB EAX,0 instruction. But overlooked that generating the MOVSX wasn't necessary. Surely missed by only checking for optimal code in the general case. And a general function call in the code, so easy to overlook, note how the instruction changes to MOVZX when you compile with /J. Otherwise easily deemed necessary, there is no BT instruction that takes an 8-bit register as the 2nd operand so the AL register load isn't enough by itself.
There could be a hypothetical post-optimizer optimizer that optimizes the optimized code generated by the optimizer. And decided to keep MOVSX to improve superscalar execution. I seriously doubt it exists.
As Hans Passant already mentioned your code is a special case of the optimization. I did not look into the compiler/optimizer sources so I can only say what I expect to happen. However, here is my explanation for this.
Your code in C / C++:
while (*s == ' ' || *s == ',' || *s == '\r' || *s == '\n') {
++s;
}
is equivalent to:
while (*s == 32 || *s == 44 || *s == 13 || *s == 12) {
++s;
}
or:
while (*s == 12 || *s == 13 || *s == 32 || *s == 44) {
++s;
}
The optimizer detects that there is an "if" with multiple (>=3 times) comparisons of the same character. Now, the optimizer generates as many 64bit patterns as needed (up to 4 patterns for an 8 bit char, 64*4 => all 256 possible values). Each of this bit patterns has an offset (= test-value to which the first bit in the pattern corresponds to) that needs to be subtracted from the value in test. In your case only one 64bit-pattern is needed for the values ranging from 12 to 44. So your code gets converted into some generic code like this:
#define ranged_bittest(pattern, value, min_val, max_val) \
(val >= min_val) && (val <= max_val) && BT_with_offset(pattern, val, min_val)
while ( !ranged_bittest(<BITPATTERN>, *s, 12, 44) ) {
++s;
}
Now some other optimization takes ranged_bittest(<BITPATTERN>, *s, 12, 44);
as it detects a bittest with start offset 12 and a pattern that can be safely shiftet left by 12 bits (as the upper 12 bits of pattern are zero).
ranged_bittest(<BITPATTERN>, *s, 12, 44)
=> ranged_bittest(<BITPATTERN> << 12, *s, 0, 44)
This evolves to:
while (!(val >= 0 && val <= 44 && BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0))) {
++s;
}
The val >= 0 &&
gets optimized away (as it is assumed to be always true) and the "while" is translated to some "do"-loop with breaks; (which is better for branch prediction in most cases) resulting in:
do {
if (val > 44) break;
if (BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0)) break;
++s;
} while (true);
For any reason the optimization stops here (e.g. because there is a limit on how often further optimizations are applied to a code fragment for performance reasons).
Now in assembly the line
if (BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0)) break;`
is translated to something like:
MOV <reg64_A>, const_pattern
MOV <reg64_B>, value
SUB <reg64_B>, const_offset
BT <reg64_A>, <reg64_B>
The assembly optimizer reduces:
MOV <reg64_B>, value
SUB <reg64_B>, 0
to just
MOVSX <reg64_B>, value
in your special case this is:
MOVSX rax, al
The assembly optimizer (somehow) does not have enough information to completely eliminate the sign extension. Maybe the assembly optimizer is quite "dumb" (can't do any logical expression optimizations and stuff, just simple replacements) or the full optimization level isn't activated yet.
It can't therefore remove the movsx rax,al
, as it doesn't have any meta information about the possible values of rax
or al
at this point of code.
I don't KNOW, if this is true, but this my best guess for this case...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With