Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the equivalent of CPython string concatenation, in C++? [duplicate]

Tags:

c++

python

Possible Duplicate:
Simple string concatenation

Yesterday, as I'm writing this, someone asked on SO

if i have a string x='wow' applying the function add in python :

x='wow'
x.add(x)
'wowwow'

how can i do that in C++?

With add (which is non-existent) corrected to __add__ (a standard method) this is a deep and interesting question, involving both subtle low level details, high level algorithm complexity considerations, and even threading!, and yet it’s formulated in a very short and concise way.

I am reposting the original question as my own because I did not get a chance to provide a correct answer before it was deleted, and my efforts at having the original question revived, so that I could help to increase the general understanding of these issues, failed.

I have changed the original title “select python or C++” to …

  • What is the equivalent of CPython string concatenation, in C++?

thereby narrowing the question a little.

like image 958
Cheers and hth. - Alf Avatar asked Oct 23 '12 00:10

Cheers and hth. - Alf


1 Answers

The general meaning of the code snippet.

The given code snippet

x = 'wow'
x.__add__( x )

has different meanings in Python 2.x and Python 3.x.

In Python 2.x the strings are by default narrow strings, with one byte per encoding unit, corresponding to C++ char based strings.

In Python 3.x the strings are wide strings, guaranteed to represent Unicode, corresponding to the in practice usage of C++ wchar_t based strings, and likewise with an unspecified 2 or 4 bytes per encoding unit.

Disregarding efficiency the __add__ method behaves the same in both main Python versions, corresponding to the C++ + operator for std::basic_string (i.e., for std::string and std::wstring), e.g. quoting the CPython 3k documentation:

object.__add__(self, other)
… to evaluate the expression x + y, where x is an instance of a class that has an __add__() method, x.__add__(y) is called.

So as an example, the CPython 2.7 code

x = 'wow'
y = x.__add__( x )
print y

would normally be written as

x = 'wow'
y = x + x
print y

and corresponds to this C++ code:

#include <iostream>
#include <string>
using namespace std;

int main()
{
    auto const x = string( "wow" );
    auto const y = x + x;
    cout << y << endl;
}

A main difference from the many incorrect answers given for the original question, is that the C++ correspondence is an expression, and not an update.

It is perhaps natural to think that the method name __add__ signifies a change of the string object’s value, an update, but with respect to observable behavior Python strings are immutable strings. Their values never change, as far as can be directly observed in Python code. This is the same as in Java and C#, but very unlike C++’s mutable std::basic_string strings.

The quadratic to linear time optimization in CPython.

CPython 2.4 added the following optimization, for narrow strings only:

String concatenations in statements of the form s = s + "abc" and s += "abc" are now performed more efficiently in certain circumstances. This optimization won't be present in other Python implementations such as Jython, so you shouldn't rely on it; using the join() method of strings is still recommended when you want to efficiently glue a large number of strings together. (Contributed by Armin Rigo.)

It may not sound like much, but where it’s applicable this optimization reduces a sequence of concatenations from quadratic time O(n2) to linear time O(n), in the length n of the final result.

First of all the optimization replaces concatenations with updates, e.g. as if

x = x + a
x = x + b
x = x + c

or for that matter

x = x + a + b + c

was replaced with

x += a
x += b
x += c

In the general case there will be many references to the string object that x refers to, and since Python string objects must appear to be immutable, the first update assignment cannot change that string object. It therefore, generally, has to create a completely new string object, and assign that (reference) to x.

At this point x holds the only reference to that object. Which means that the object can be updated by the update assignment that appends b, because there are no observers. And likewise for the append of c.

It’s a bit like quantum mechanics: you cannot observe this dirty thing going on, and it’s never done when there is the possibility of anyone observing the machinations, but you can infer that it must have been going on by the statistics that you collect about performance, for linear time is quite different from quadratic time!

How is linear time achieved? Well, with the update the same strategy of buffer doubling as in C++ std::basic_string can be done, which means that the existing buffer contents only need to be copied at each buffer reallocation, and not for each append operation. Which means that the total cost of copying is at worst linear in the final string size, in the same way as the sum (representing the costs of copying at each buffer doubling) 1 + 2 + 4 + 8 + … + N is less than 2*N.

Linear time string concatenation expressions in C++.

In order to faithfully reproduce the CPython code snippet in C++,

  • the final result and expression-nature of the operation be should captured,

  • and also its performance characteristics should be captured!

A direct translation of CPython __add__ to C++ std::basic_string + fails to reliably capture the CPython linear time. The C++ + string concatenation may be optimized by the compiler in the same way as the CPython optimization. Or not – which then means that one has told a beginner that the C++ equivalent of a Python linear time operation, is something with quadratic time – hey, this is what you should use…

For the performance characteristics C++ += is the basic answer, but, this does not catch the expression nature of the Python code.

The natural answer is a linear time C++ string builder class that translates a concatenation expression to a series of += updates, so that the Python code

from __future__ import print_function

def foo( s ):
    print( s )

a = 'alpha'
b = 'beta'
c = 'charlie'
foo( a + b + c )    # Expr-like linear time string building.

correspond to roughly

#include <string>
#include <sstream>

namespace my {
    using std::string;
    using std::ostringstream;

    template< class Type >
    string stringFrom( Type const& v )
    {
        ostringstream stream;
        stream << v;
        return stream.str();
    }

    class StringBuilder
    {
    private:
        string      s_;

        template< class Type >
        static string fastStringFrom( Type const& v )
        {
            return stringFrom( v );
        }

        static string const& fastStringFrom( string const& s )
        { return s; }

        static char const* fastStringFrom( char const* const s )
        { return s; }

    public:
        template< class Type >
        StringBuilder& operator<<( Type const& v )
        {
            s_ += fastStringFrom( v );
            return *this;
        }

        string const& str() const { return s_; }
        char const* cStr() const { return s_.c_str(); }

        operator string const& () const { return str(); }
        operator char const* () const { return cStr(); }
    };
}  // namespace my

#include <iostream>
using namespace std;
typedef my::StringBuilder S;

void foo( string const& s )
{
    cout << s << endl;
}

int main()
{
    string const    a   = "alpha";
    string const    b   = "beta";
    string const    c   = "charlie";

    foo( S() << a << b << c );      // Expr-like linear time string building.
}
like image 188
Cheers and hth. - Alf Avatar answered Sep 18 '22 10:09

Cheers and hth. - Alf