Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is gcc's right-shift code different in C and C++ mode?

When ARM gcc 9.2.1 is given command line options -O3 -xc++ -mcpu=cortex-m0 [compile as C++] and the following code:

unsigned short adjust(unsigned short *p)
{
    unsigned short temp = *p;
    temp -= temp>>15;
    return temp;
}

it produces the reasonable machine code:

    ldrh    r0, [r0]
    lsrs    r3, r0, #15
    subs    r0, r0, r3
    uxth    r0, r0
    bx      lr

which is equivalent to:

unsigned short adjust(unsigned short *p)
{
    unsigned r0,r3;
    r0 = *p;
    r3 = temp >> 15;
    r0 -= r3;
    r0 &= 0xFFFFu;   // Returning an unsigned short requires...
    return r0;       //  computing a 32-bit unsigned value 0-65535.
}

Very reasonable. The last "uxtw" could actually be omitted in this particular case, but it's better for a compiler that can't prove the safety of such optimizations to err on the side of caution than risk returning a value outside the range 0-65535, which could totally sink downstream code.

When using -O3 -xc -mcpu=cortex-m0 [identical options, except compiling as C rather than C++], however, the code changes:

    ldrh    r3, [r0]
    movs    r2, #0
    ldrsh   r0, [r0, r2]
    asrs    r0, r0, #15
    adds    r0, r0, r3
    uxth    r0, r0
    bx      lr

unsigned short adjust(unsigned short *p)
{
    unsigned r0,r2,r3;
    r3 = *p;
    r2 = 0;
    r0 = ((unsigned short*)p)[r2];
    r0 = ((int)r0) >> 15;  // Effectively computes -((*p)>>15) with redundant load
    r0 += r3
    r0 &= 0xFFFFu;     // Returning an unsigned short requires...
    return temp;       //  computing a 32-bit unsigned value 0-65535.
}

I know that the defined corner cases for left-shift are different in C and C++, but I thought right shifts were the same. Is there something different about the way right-shifts work in C and C++ that would cause the compiler to use different code to process them? Versions prior to 9.2.1 generate slightly less bad code in C mode:

    ldrh    r3, [r0]
    sxth    r0, r3
    asrs    r0, r0, #15
    adds    r0, r0, r3
    uxth    r0, r0
    bx      lr

equivalent to:

unsigned short adjust(unsigned short *p)
{
    unsigned r0,r3;
    r3 = *p;
    r0 = (short)r3;
    r0 = ((int)r0) >> 15; // Effectively computes -(temp>>15)
    r0 += r3
    r0 &= 0xFFFFu;     // Returning an unsigned short requires...
    return temp;       //  computing a 32-bit unsigned value 0-65535.
}

Not as bad as the 9.2.1 version, but still an instruction longer than a straightforward translation of the code would have been. When using 9.2.1, declaring the argument as unsigned short volatile *p would eliminate the redundant load of p, but I'm curious why gcc 9.2.1 would need a volatile qualifier to help it avoid the redundant load, or why such a bizarre "optimization" only happens in C mode and not C++ mode. I'm also somewhat curious why gcc would even consider adding ((short)temp) >> 15 instead of subtracting temp >> 15. Is there some stage in the optimization where that would seem to make sense?

like image 683
supercat Avatar asked Jun 19 '20 15:06

supercat


1 Answers

The difference appears to be due to a difference in integral promotion of temp between GCC's C and C++ compilation modes.

Using the "Tree/RTL Viewer" on Compiler Explorer, one can observe that when the code is compiled as C++, GCC promotes temp to an int for the right-shift operation. However, when compiled as C temp is only promoted to a signed short (On godbolt):

GCC tree with -xc++:

{
  short unsigned int temp = *p;

  # DEBUG BEGIN STMT;
    short unsigned int temp = *p;
  # DEBUG BEGIN STMT;
  <<cleanup_point <<< Unknown tree: expr_stmt
  (void) (temp = temp - (short unsigned int) ((int) temp >> 15)) >>>>>;
  # DEBUG BEGIN STMT;
  return <retval> = temp;
}

with -xc:

{
  short unsigned int temp = *p;

  # DEBUG BEGIN STMT;
    short unsigned int temp = *p;
  # DEBUG BEGIN STMT;
  temp = (short unsigned int) ((signed short) temp >> 15) + temp;
  # DEBUG BEGIN STMT;
  return temp;
}

The cast to signed short is only made explicit when shifting temp by one bit less than its 16-bit size; when shifting by less than 15 bits, the cast disappears and the code compiles to match the "reasonable" instructions -xc++ produced. The unexpected behavior also occurs when using unsigned chars and shifting by 7 bits.

Interestingly, armv7-a clang does not produce the same behavior; both -xc and -xc++ produce a "reasonable" result:

    ldrh    r0, [r0]
    sxth    r0, r0
    lsrs    r1, r0, #15
    adds    r0, r1, r0
    uxth    r0, r0
    bx      lr

Update: So it seems this "optimization" is due to either the literal 15, or to the use of subtraction (or unary -) with the right-shift:

  • Placing the literal 15 in an unsigned short variable causes both -xc and -xc++ to produce reasonable instructions.
  • Replacing temp>>15 with temp/(1<<15) also causes both options to produce reasonable instructions.
  • Changing the shift to temp>>(-65521) causes both options to produce the longer arithmetic-shift version, with -xc++ also casting temp to signed short within the shift.
  • Moving the negative away from the shift operation (temp = -temp + temp>>15; return -temp;) causes both options to produce reasonable instructions.

See examples these on Godbolt. I would agree with @supercat that this may just be an odd case of the as-if rule. The takeaways I see from this are to either avoid unsigned subtraction with non-constants, or per this SO post about int promotion, maybe don't try to force the arithmetic into smaller-than-int storage types.

like image 187
clyne Avatar answered Oct 14 '22 05:10

clyne