Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Possibility of COW std::string implementation in C++11

Today I passed by this SO question: Legality of COW std::string implementation in C++11

The most voted answer (35 upvotes) for that question says:

It's not allowed, because as per the standard 21.4.1 p6, invalidation of iterators/references is only allowed for

— as an argument to any standard library function taking a reference to non-const basic_string as an argument.

— Calling non-const member functions, except operator[], at, front, back, begin, rbegin, end, and rend.

For a COW string, calling non-const operator[] would require making a copy (and invalidating references), which is disallowed by the paragraph above. Hence, it's no longer legal to have a COW string in C++11.

I wonder whether that justification is valid or not because it seems C++03 has similar requirements for string iterator invalidation:

References, pointers, and iterators referring to the elements of a basic_string sequence may be invalidated by the following uses of that basic_string object:

  • As an argument to non-member functions swap() (21.3.7.8), operator>>() (21.3.7.9), and getline() (21.3.7.9).
  • As an argument to basic_string::swap().
  • Calling data() and c_str() member functions.
  • Calling non-const member functions, except operator[](), at(), begin(), rbegin(), end(), and rend().
  • Subsequent to any of the above uses except the forms of insert() and erase() which return iterators, the first call to non-const member functions operator[](), at(), begin(), rbegin(), end(), or rend().

These are not exactly the same as those of C++11's, but at least the same for the part of operator[](), which the original answer took as the major justification. So I guess, in order to justify the illegality of COW std::string implementation in C++11, some other standard requirements need to be cited. Help needed here.


That SO question has been inactive for over a year, so I've decided to raise this as a separate question. Please let me know if this is inappropriate and I will find some other way to clear my doubt.

like image 464
goodbyeera Avatar asked Mar 03 '14 13:03

goodbyeera


People also ask

Can you use std::string in C?

The std::string class manages the underlying storage for you, storing your strings in a contiguous manner. You can get access to this underlying buffer using the c_str() member function, which will return a pointer to null-terminated char array. This allows std::string to interoperate with C-string APIs.

How are C++ strings implemented?

The c++ solution for strings are quite different from the c-version. The first and most important difference is while the c using the ASCIIZ solution, the std::string and std::wstring are using two iterators (pointers) to store the actual string.

What is std::string in C++?

C++ has in its definition a way to represent a sequence of characters as an object of the class. This class is called std:: string. String class stores the characters as a sequence of bytes with the functionality of allowing access to the single-byte character.

Is std::string the same as string?

There is no functionality difference between string and std::string because they're the same type.


1 Answers

The key point is the last point in the C++03 standard. The wording could be a lot clearer, but the intent is that the first call to [], at, etc. (but only the first call) after something which established new iterators (and thus invalidated old ones) could invalidate iterators, but only the first. The wording in C++03 was, in fact, a quick hack, inserted in response to comments by the French national body on the CD2 of C++98. The original problem is simple: consider:

std::string a( "some text" );
std::string b( a );
char& rc = a[2];

At this point, modifications through rc must affect a, but not b. If COW is being used, however, when a[2] is called, a and b share a representation; in order for writes through the returned reference not to affect b, a[2] must be considered a "write", and be allowed to invalidate the reference. Which is what CD2 said: any call to a non-const [], at, or one of the begin or end functions could invalidate iterators and references. The French national body comments pointed out that this rendered a[i] == a[j] invalid, since the reference returned by one of the [] would be invalidated by the other. The last point you cite of C++03 was added to circumvent this—only the first call to [] et al. could invalidate the iterators.

I don't think anyone was totally happy with the results. The wording was done quickly, and while the intent was clear to those who were aware of the history, and the original problem, I don't think it was fully clear from standard. In addition, some experts began to question the value of COW to begin with, given the relative impossibility of the string class itself to reliably detect all writes. (If a[i] == a[j] is the complete expression, there is no write. But the string class itself must assume that the return value of a[i] may result in a write.) And in a multi-threaded environment, the cost of managing the reference count needed for copy on write was deemed a relatively high cost for something you usually don't need. The result is that most implementations (which supported threading long before C++11) have been moving away from COW anyway; as far as I know, the only major implementation still using COW was g++ (but there was a known bug in their multithreaded implementation) and (maybe) Sun CC (which the last time I looked at it, was inordinately slow, because of the cost of managing the counter). I think the committee simply took what seemed to them the simplest way of cleaning things up, by forbidding COW.

EDIT:

Some more clarification with regards to why a COW implementation has to invalidate iterators on the first call to []. Consider a naïve implementation of COW. (I will just call it String, and ignore all of the issues involving traits and allocators, which aren't really relevant here. I'll also ignore exception and thread safety, just to make things as simple as possible.)

class String
{
    struct StringRep
    {
        int useCount;
        size_t size;
        char* data;
        StringRep( char const* text, size_t size )
            : useCount( 1 )
            , size( size )
            , data( ::operator new( size + 1 ) )
        {
            std::memcpy( data, text, size ):
            data[size] = '\0';
        }
        ~StringRep()
        {
            ::operator delete( data );
        }
    };

    StringRep* myRep;
public:
    String( char const* initial_text )
        : myRep( new StringRep( initial_text, strlen( initial_text ) ) )
    {
    }
    String( String const& other )
        : myRep( other.myRep )
    {
        ++ myRep->useCount;
    }
    ~String()
    {
        -- myRep->useCount;
        if ( myRep->useCount == 0 ) {
            delete myRep;
        }
    }
    char& operator[]( size_t index )
    {
        return myRep->data[index];
    }
};

Now imagine what happens if I write:

String a( "some text" );
String b( a );
a[4] = '-';

What is the value of b after this? (Run through the code by hand, if you're not sure.)

Obviously, this doesn't work. The solution is to add a flag, bool uncopyable; to StringRep, which is initialized to false, and to modify the following functions:

String::String( String const& other )
{
    if ( other.myRep->uncopyable ) {
        myRep = new StringRep( other.myRep->data, other.myRep->size );
    } else {
        myRep = other.myRep;
        ++ myRep->useCount;
    }
}

char& String::operator[]( size_t index )
{
    if ( myRep->useCount > 1 ) {
        -- myRep->useCount;
        myRep = new StringRep( myRep->data, myRep->size );
    }
    myRep->uncopyable = true;
    return myRep->data[index];
}

This means, of course, that [] will invalidate iterators and references, but only the first time it is called on an object. The next time, the useCount will be one (and the image will be uncopyable). So a[i] == a[j] works; regardless of which the compiler actually evaluates first (a[i] or a[j]), the second one will find a useCount of 1, and will not have to duplicate. And because of the uncopyable flag,

String a( "some text" );
char& c = a[4];
String b( a );
c = '-';

will also work, and not modify b.

Of course, the above is enormously simplified. Getting it to work in a multithreaded environment is extremely difficult, unless you simply grab a mutex for the entire function for any function which might modify anything (in which case, the resulting class is extremely slow). G++ tried, and failed—there is on particular use case where it breaks. (Getting it to handle the other issues I've ignored is not particularly difficult, but does represent a lot of lines of code.)

like image 185
James Kanze Avatar answered Sep 27 '22 18:09

James Kanze