Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

gcc strange behavior in certain array pre-/post-increment cases (edited)

Tags:

c

gcc

I'm writing a compiler that matches, within {} scope, by and large C99's semantics. When trying to reverse-engineer how gcc handles certain 'undefined behaviour', concretely, chained pre- and post-increments of variables, I noticed that it gets hopelessly confused if you combine this with modifying assignments (e.g., "*=") and array access. Simplifying to the easiest point of apparent utter confusion, gcc 4.6.3. evaluates (with and without option -std=c99):

a[0] = 2;
a[0] *= a[0]++;

to

a[0] = 3.

Am I mis-remembering the standard incorrectly? Is any use of a pre- or post-increment already undefined, not only chained use in a compound expression?

Also, even if the behaviour is 'undefined', the above seems like a particularly poor way of calculating the result as I could only see how you'd justify a result of 5 ( = 2*2 + 1, what I would have implemented - post-increment after an assignment statement), or 6 ( = 3 * 2, use a variable, then immediately post-increment it, and process in the order of parsing - the parser is almost certain to evaluate the "*=" after evaluating the RHS expression). Any insight into this - from a C or C++ angle?

I noticed this when trying to combine arrays with integer expression bounds with pre- and post-increments, and realize this is really hard; but still, the above seems a little bit like a cop-out considering the flagship status of gcc.

This is under Ubuntu 12.04.

Edit: I should have added that gcc's behavior can be reverse-engineered if the variable is not an array element - at least all examples I tried work as follows: (1) evaluate all compound expression pre-increments; (2) evaluate the expression; (3) evaluate all compound expression post-increments. So it probably has to do with the 'really hard' above as well.

Note: clang produces the philosophically reasonable value of 6. I ran more elaborate cases with clang, and am reasonably certain that it treats the array access and scalar case the same, and operates as in what I described above as the second philosophically reasonable way.

like image 928
gnometorule Avatar asked Feb 13 '23 20:02

gnometorule


2 Answers

The mutations in assignments (including read-and-mutate assignments such as *=) and in post-increments can happen at any time in the evaluation of the expression following the initial access of the value of the mutated cell. Consequently, the mutations to a[0] in *= and a[0]++ are not ordered with respect to each other. Apparently in this case gcc chooses to do them left-to-right.

Or to put it more precisely, the expression:

x *= y++;

can be rewritten as:

tmp1 = y + 1;        tmp2 = x * y;
x = tmp1;            y = tmp2;

where the columns can be interleaved in any order.

Note that in practice, both of the tmp variables are probably machine registers. In fact, what is probably going on is more like this:

r1  = y;
r2  = r1 + 1;
r3  = x;
r4  = r3 * r1;
x = r4;
y = r2;

So why do the last two assignments in that order rather than the other order? Well, why not?

Obviously, the confusion here is that x and y are the same location, but gcc is not obliged to notice that. It probably uses an optimization heuristic based on the assumption that they are different locations.

But suppose that it did notice that they are the same. In that case, one or other of the assignments can be eliminated. And furthermore, the computation of the temporary which is to be stored can also be eliminate. So the compiler would have the options of eliminating r4 = r3 * r1; x = r4 or r2 = r1 + 1; y = r2. With the choice of eliminating an increment or a multiplication, what would a self-respecting optimizer do?

like image 72
rici Avatar answered Feb 15 '23 10:02

rici


Am I mis-remembering the standard incorrectly? Is any use of a pre- or post-increment already undefined, not only chained use in a compound expression?

From the horse's mouth (2011 draft):

6.5 Expressions
...
2 If a side effect on a scalar object is unsequenced relative to either a different side effect on the same scalar object or a value computation using the value of the same scalar object, the behavior is undefined. If there are multiple allowable orderings of the subexpressions of an expression, the behavior is undefined if such an unsequenced side effect occurs in any of the orderings.84)

84) This paragraph renders undefined statement expressions such as
  i = ++i + 1;
  a[i++] = i;
while allowing
  i = i + 1;
  a[i] = i;

Undefined means undefined; the compiler is not obligated to produce a meaningful or logical result (it can if it wants to, but it doesn't have to); it's not even required to reproduce the same result for each run. gcc does something, but it's likely that something is pretty random given all the freedom for ordering operations.

Note that a[0] *= a[1]++ would be well-defined (provided neither a[0] nor a[1] contain trap representations, anyway).

Edit

As to how you could get 3 out of all that, I have an idea...

  1. r1 <- a[0] (evaluate LHS)
  2. r2 <- a[0] (evaluate RHS)
  3. r1 <- r1 * r2 (perform multiplication)
  4. a[0] <- r1 (write result of multiplication to LHS)
  5. a[0] <- r2 + 1 (apply side effect to RHS)

Presto; you've clobbered the result of the multiplication because of how the side effect is applied. Whether gcc actually does something like that is an open issue, and either way it's irrelevant; undefined behavior doesn't have to be consistent or repeatable.

like image 35
John Bode Avatar answered Feb 15 '23 09:02

John Bode