Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Initialising int affects function return value

Sorry for the vagueness of this question's title, but I'm not sure how to ask this exactly.

The following code, when executed on an Arduino microprocessor (c++ compiled for an ATMega328 microprocessor) works fine. Return values shows in comments in the code:

// Return the index of the first semicolon in a string
int detectSemicolon(const char* str) {

    int i = 0;

    Serial.print("i = ");
    Serial.println(i); // prints "i = 0"

    while (i <= strlen(str)) {
        if (str[i] == ';') {
            Serial.print("Found at i = ");
            Serial.println(i); // prints "Found at i = 2"
            return i;
        }
        i++;
    }

    Serial.println("Error"); // Does not execute
    return -999;
}

void main() {
    Serial.begin(250000);
    Serial.println(detectSemicolon("TE;ST")); // Prints "2"
}

This outputs "2" as the position of the first semicolon, as expected.

However, if I change the first line of the detectSemicolon function to int i; i.e. without the explicit initialisation, I get problems. Specifically, the output is "i = 0" (good), "Found at i = 2" (good), "-999" (bad!).

So the function is returning -999 despite having executed the print statement immediately before a return 2; line and despite never executing the print statement immediately before the return -999; line.

Can someone help me to understand what's happening here? I understand that variables inside functions in c can theoretically contain any old junk unless they're initialised, but here I'm specifically checking in a print statement that this hasn't happened, and yet...


EDIT: Thanks to everyone who's chipped in, and particularly to underscore_d for their great answer. It seems like undefined behaviour is indeed causing the compiler to just skip anything involving i. Here's some of the assembly with the serial.prints within detectSemicolon commented out:

void setup() {
    Serial.begin(250000);
    Serial.println(detectSemicolon("TE;ST")); // Prints "2"
  d0:   4a e0           ldi r20, 0x0A   ; 10
  d2:   50 e0           ldi r21, 0x00   ; 0
  d4:   69 e1           ldi r22, 0x19   ; 25
  d6:   7c ef           ldi r23, 0xFC   ; 252
  d8:   82 e2           ldi r24, 0x22   ; 34
  da:   91 e0           ldi r25, 0x01   ; 1
  dc:   0c 94 3d 03     jmp 0x67a   ; 0x67a <_ZN5Print7printlnEii>

It looks like the compiler is actually completely disregarding the while loop and concluding that the output will always be "-999", and so it doesn't even bother with a call to the function, instead hard coding 0xFC19. I'll have another look with the serial.prints enabled so that the function still gets called, but this is a strong pointer I think.


EDIT 2:

For those who really care, here's a link to the disassembled code exactly as shown above (in the UB case):

https://justpaste.it/vwu8

If you look carefully, the compiler seems to be designating register 28 as the location of i and "initialising" it to zero in line d8. This register gets treated as if it contains i throughout in the while loops, if statements etc, which is why the code appears to work and the print statements output as expected (e.g. line 122 where "i" gets incremented).

However, when it comes to returning this pseudo-variable, this is a step too far for our tried and tried-upon compiler; it draws the line, and dumps us to the other return statement (line 120 jumps to line 132, loading "-999" into registers 24 and 25 before returning to main()).

Or at least, that's as far as I can get with my limited grasp of assembly. Moral of the story is weird stuff happens when your code's behaviour is undefined.

like image 610
CharlieB Avatar asked Dec 24 '22 04:12

CharlieB


1 Answers

Like all basic types of non-static storage duration, declaring but not defining an int does not cause default initialisation. It leaves the variable uninitialised. That does not mean i just holds a random value. It holds no (known, valid) value, and therefore you're not allowed to read it yet.

Here's the relevant quote from the C++11 Standard, via Angew in the comments. This wasn't a new restriction, nor has it changed since then:

C++11 4.1/1, talking about an lvalue-to-rvalue conversion (basically reading a variable's value): "If the object to which the glvalue refers is ... uninitialized, a program that necessitates this conversion has undefined behavior."

Any read of an unitialised variable causes undefined behaviour, and so anything can happen. Rather than your program continuing to function as expected using some unknown default value, compilers can make it do absolutely anything, because the behaviour is undefined, and the Standard imposes no requirements on what should happen in such a scenario.

In practical terms, that usually means an optimising compiler might simply remove any code that relies in any way on UB. There's no way to make a correct decision about what to do, so it's perfectly valid to decide to do nothing (which just happens also to be an optimisation for size and often speed). Or as commenters have mentioned, it might keep the code but replace attempts to read i with the nearest unrelated value to hand, or with different constants in different statements, or etc.

Printing a variable doesn't count as 'checking it' as you think, so that makes no difference. There is no way to 'check' an uninitialised variable and thereby to inoculate yourself against UB. The behaviour of reading the variable is only defined if the program has already written a specific value to it.

There is no point in us speculating on why particular arbitrary types of UB occur: you just need to fix your code so that it operates deterministically.

Why do you want to use it uninitialised anyway? Is this just 'academic'?

like image 56
underscore_d Avatar answered Jan 09 '23 12:01

underscore_d