Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Would this optimization in the implementation of std::string be allowed?

I was just thinking about the implementation of std::string::substr. It returns a new std::string object, which seems a bit wasteful to me. Why not return an object that refers to the contents of the original string and can be implicitly assigned to a std::string? A kind of lazy evaluation of the actual copying. Such a class could look something like this:

template <class Ch, class Tr, class A>
class string_ref {
public:
    // not important yet, but *looks* like basic_string's for the most part

private:
    const basic_string<Ch, Tr, A> &s_;
    const size_type pos_;
    const size_type len_;    
};

The public interface of this class would mimic all of the read-only operations of a real std::string, so the usage would be seamless. std::string could then have a new constructor which takes a string_ref so the user would never be the wiser. The moment you try to "store" the result, you end up creating a copy, so no real issues with the reference pointing to data and then having it modified behind its back.

The idea being that code like this:

std::string s1 = "hello world";
std::string s2 = "world";
if(s1.substr(6) == s2) {
    std::cout << "match!" << std::endl;
}

would have no more than 2 std::string objects constructed in total. This seems like a useful optimization for code which that performs a lot of string manipulations. Of course, this doesn't just apply to std::string, but to any type which can return a subset of its contents.

As far as I know, no implementations do this.

I suppose the core of the question is:

Given a class that can be implicitly converted to a std::string as needed, would it be conforming to the standard for a library writer to change the prototype of a member's to return type? Or more generally, do the library writers have the leeway to return "proxy objects" instead of regular objects in these types of cases as an optimization?

My gut is that this is not allowed and that the prototypes must match exactly. Given that you cannot overload on return type alone, that would leave no room for library writers to take advantage of these types of situations. Like I said, I think the answer is no, but I figured I'd ask :-).

like image 540
Evan Teran Avatar asked Jan 13 '11 16:01

Evan Teran


2 Answers

This idea is copy-on-write, but instead of COW'ing the entire buffer, you keep track of which subset of the buffer is the "real" string. (COW, in its normal form, was (is?) used in some library implementations.)

So you don't need a proxy object or change of interface at all because these details can be made completely internal. Conceptually, you need to keep track of four things: a source buffer, a reference count for the buffer, and the start and end of the string within this buffer.

Anytime an operation modifies the buffer at all, it creates its own copy (from the start and end delimiters), decreases the old buffer's reference count by one, and sets the new buffer's reference count to one. The rest of the reference counting rules are the same: copy and increase count by one, destruct a string and decrease count by one, reach zero and delete, etc.

substr just makes a new string instance, except with the start and end delimiters explicitly specified.

like image 186
GManNickG Avatar answered Sep 27 '22 21:09

GManNickG


This is a quite well-known optimization that is relatively widely used, called copy-on-write or COW. The basic thing is not even to do with substrings, but with something as simple as

s1 = s2;

Now, the problem with this optimization is that for C++ libraries that are supposed to be used on targets supporting multiple threads, the reference count for the string has to be accessed using atomic operations (or worse, protected with a mutex in case the target platform doesn't supply atomic operations). This is expensive enough that in most cases the simple non-COW string implementation is faster.

See GOTW #43-45:

http://www.gotw.ca/gotw/043.htm

http://www.gotw.ca/gotw/044.htm

http://www.gotw.ca/gotw/045.htm

To make matters worse, libraries that have used COW, such as the GNU C++ library, cannot simply revert to the simple implementation since that would break the ABI. (Although, C++0x to the rescue, as that will require an ABI bump anyway! :) )

like image 40
janneb Avatar answered Sep 27 '22 19:09

janneb