Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does FBString's small string optimization rely on undefined behavior?

Facebook's fbstring_core class uses the "Small String Optimization" described in this talk wherein the storage for the class' data members -- a Char*, size and capacity -- will be repurposed to store character data if the string is sufficiently small. The flag bits used to distinguish between these cases are located in the "rightmost char of the storage". My question is whether accessing these bits through the bytes_ union member, which is never actually written, constitutes undefined behavior per the C++11 standard? The answer to Accessing inactive union member and undefined behavior? suggests it is.

The following excerpt contains the declaration of these members and the category() member function that is used to determine whether this optimization is in effect.

    typedef uint8_t category_type;

    enum class Category : category_type {
      isSmall = 0,
      isMedium = kIsLittleEndian ? 0x80 : 0x2,
      isLarge = kIsLittleEndian ? 0x40 : 0x1,
    };

    Category category() const {
      // works for both big-endian and little-endian
      return static_cast<Category>(bytes_[lastChar] & categoryExtractMask);
    }

    struct MediumLarge {
      Char * data_;
      size_t size_;
      size_t capacity_;

      size_t capacity() const {
        return kIsLittleEndian
          ? capacity_ & capacityExtractMask
          : capacity_ >> 2;
      }

      void setCapacity(size_t cap, Category cat) {
        capacity_ = kIsLittleEndian
            ? cap | (static_cast<size_t>(cat) << kCategoryShift)
            : (cap << 2) | static_cast<size_t>(cat);
      }
    };

    union {
      uint8_t bytes_[sizeof(MediumLarge)]; // For accessing the last byte.
      Char small_[sizeof(MediumLarge) / sizeof(Char)];
      MediumLarge ml_;
    };

It seems that this implementation relies on using "type punning" to access a byte that might actually be part of the size_t capacity_ member. From the answer to the question linked above, I gather that this is defined behavior in C99, but not in C++11?

like image 330
drewbarbs Avatar asked Aug 26 '17 22:08

drewbarbs


1 Answers

Not only does that appear to be UB, it is quite unnecessary, because the only use of bytes_ appears to be for reading the last byte of this, which can be done without UB:

reinterpret_cast<const char*>(this)[sizeof(*this) - 1]

That's thanks to the special exemption in C++ which allows reinterpreting objects as char arrays.

like image 161
John Zwinck Avatar answered Oct 21 '22 18:10

John Zwinck