Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using `size_t` for lengths impacts on compiler optimizations?

While reading this question, I've seen the first comment saying that:

size_t for length is not a great idea, the proper types are signed ones for optimization/UB reasons.

followed by another comment supporting the reasoning. Is it true?

The question is important, because if I were to write e.g. a matrix library, the image dimensions could be size_t, just to avoid checking if they are negative. But then all loops would naturally use size_t. Could this impact on optimization?

like image 945
Costantino Grana Avatar asked Jan 02 '23 10:01

Costantino Grana


1 Answers

size_t being unsigned is mostly an historical accident - if your world is 16 bit, going from 32767 to 65535 maximum object size is a big win; in current-day mainstream computing (where 64 and 32 bit are the norm) the fact that size_t is unsigned is mostly a nuisance.

Although unsigned types have less undefined behavior (as wraparound is guaranteed), the fact that they have mostly "bitfield" semantics is often cause of bugs and other bad surprises; in particular:

  • difference between unsigned values is unsigned as well, with the usual wraparound semantics, so if you may expect a negative value you have to cast beforehand;

    unsigned a = 10, b = 20;
    // prints UINT_MAX-10, i.e. 4294967286 if unsigned is 32 bit
    std::cout << a-b << "\n"; 
    
  • more in general, in signed/unsigned comparisons and mathematical operations unsigned wins (so the signed value is casted to unsigned implicitly) which, again, leads to surprises;

    unsigned a = 10;
    int b = -2;
    if(a < b) std::cout<<"a < b\n"; // prints "a < b"
    
  • in common situations (e.g. iterating backwards) the unsigned semantics are often problematic, as you'd like the index to go negative for the boundary condition

    // This works fine if T is signed, loops forever if T is unsigned
    for(T idx = c.size() - 1; idx >= 0; idx--) {
        // ...
    }
    

Also, the fact that an unsigned value cannot assume a negative value is mostly a strawman; you may avoid checking for negative values, but due to implicit signed-unsigned conversions it won't stop any error - you are just shifting the blame. If the user passes a negative value to your library function taking a size_t, it will just become a very big number, which will be just as wrong if not worse.

int sum_arr(int *arr, unsigned len) {
    int ret = 0;
    for(unsigned i = 0; i < len; ++i) {
        ret += arr[i];
    }
    return ret;
}

// compiles successfully and overflows the array; it len was signed,
// it would just return 0
sum_arr(some_array, -10);

For the optimization part: the advantages of signed types in this regard are overrated; yes, the compiler can assume that overflow will never happen, so it can be extra smart in some situations, but generally this won't be game-changing (as in general wraparound semantics comes "for free" on current day architectures); most importantly, as usual if your profiler finds that a particular zone is a bottleneck you can modify just it to make it go faster (including switching types locally to make the compiler generate better code, if you find it advantageous).

Long story short: I'd go for signed, not for performance reasons, but because the semantics is generally way less surprising/hostile in most common scenarios.

like image 89
Matteo Italia Avatar answered Jan 27 '23 12:01

Matteo Italia