The following implementation of square produces a series of cmp/je statements like I would expect of a chained if statement:
int square(int num) { if (num == 0){ return 0; } else if (num == 1){ return 1; } else if (num == 2){ return 4; } else if (num == 3){ return 9; } else if (num == 4){ return 16; } else if (num == 5){ return 25; } else if (num == 6){ return 36; } else if (num == 7){ return 49; } else { return num * num; } }
And the following produces a data table for return:
int square_2(int num) { switch (num){ case 0: return 0; case 1: return 1; case 2: return 4; case 3: return 9; case 4: return 16; case 5: return 25; case 6: return 36; case 7: return 49; default: return num * num; } }
Why is gcc unable to optimize the top one into the bottom one?
Dissassembly for reference: https://godbolt.org/z/UP_igi
EDIT: interestingly, MSVC generates a jump table instead of a data table for the switch case. And surprisingly, clang optimizes them to the same result.
Speed: A switch statement might prove to be faster than ifs provided number of cases are good. If there are only few cases, it might not effect the speed in any case. Prefer switch if the number of cases are more than 5 otherwise, you may use if-else too.
If we have numerous options, the switch statement is the best solution because it executes considerably faster than the 'if-else' statement. Difficult to edit nested if-else statements.
The switch statement has a fixed depth. It allows the best-optimized implementation for faster code execution than the “if-else if” statement. It is easy to debug and maintain the programs using switch statements. The switch statement has faster execution power.
In computer programming languages, a switch statement is a type of selection control mechanism used to allow the value of a variable or expression to change the control flow of program execution via search and map.
The generated code for switch-case
conventionally uses a jump table. In this case, the direct return through a look-up table seems to be an optimization making use of the fact that every case here involves a return. Though the standard makes no guarantees to that effect, I would be surprised if a compiler were to generate a series of compares instead of a jump-table for a conventional switch-case.
Now coming to if-else
, it is the exact opposite. While switch-case
executes in constant time, irrespective of the number of branches, if-else
is optimized for a smaller number of branches. Here, you would expect the compiler to basically generate a series of comparisons in the order that you have written them.
So if I had used if-else
because I expect most calls to square()
to be for 0
or 1
and rarely for other values, then 'optimizing' this to a table-lookup could actually cause my code to run slower than I expect, defeating my purpose for using an if
instead of a switch
. So although it is debatable, I feel GCC is doing the right thing and clang is being overly aggressive in its optimization.
Someone had, in the comments, shared a link where clang does this optimization and generates lookup-table based code for if-else
as well. Something notable happens when we reduce the number of cases to just two (and a default) with clang. It once again generates identical code for both if and switch, but this time, switches over to compares and moves instead of the lookup-table approach, for both. This means that even the switch-favoring clang knows that the 'if' pattern is more optimal when the number of cases is small!
In summary, a sequence of compares for if-else
and a jump-table for switch-case
is the standard pattern that compilers tend to follow and developers tend to expect when they write code. However, for certain special cases, some compilers might choose to break this pattern where they feel it provides better optimization. Other compilers might just choose to stick to the pattern anyway, even if apparently sub-optimal, trusting the developer to know what he wants. Both are valid approaches with their own advantages and disadvantages.
One possible rationale is that if low values of num
are more likely, for example always 0, the generated code for the first one might be faster. The generated code for switch takes equal time for all values.
Comparing the best cases, according to this table. See this answer for the explanation of the table.
If num == 0
, for "if" you have xor, test, je (with jump), ret. Latency: 1 + 1 + jump. However, xor and test are independent so the actual execution speed would be faster than 1 + 1 cycles.
If num < 7
, for "switch" you have mov, cmp, ja (without jump), mov, ret. Latency: 2 + 1 + no jump + 2.
A jump instruction that does not result to jump is faster than one that results to jump. However, the table does not define the latency for a jump, so it is not clear to me which one is better. It is possible that the last one is always better and GCC is simply not able to optimize it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With