Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

subclass str, and make new method with same effect as +=

I'm trying to subclass str - not for anything important, just an experiment to learn more about Python built-in types. I've subclassed str this way (using __new__ because str is immutable):

class MyString(str):
    def __new__(cls, value=''):
        return str.__new__(cls, value)
    def __radd__(self, value):  # what method should I use??
        return MyString(self + value)  # what goes here??
    def write(self, data):
        self.__radd__(data)

It initializes right, as far as I can tell. but I cant get it to modify itself in-place using the += operator. I've tried overriding __add__, __radd__, __iadd__ and a variety of other configurations. Using a return statement, ive managed to get it to return a new instance of the correct appended MyString, but not modify in place. Success would look like:

b = MyString('g')
b.write('h')  # b should now be 'gh'

Any thoughts?

UPDATE

To possibly add a reason why someone might want to do this, I followed the suggestion of creating the following mutable class that uses a plain string internally:

class StringInside(object):

    def __init__(self, data=''):
        self.data = data

    def write(self, data):
        self.data += data

    def read(self):
        return self.data

and tested with timeit:

timeit.timeit("arr+='1234567890'", setup="arr = ''", number=10000)
0.004415035247802734
timeit.timeit("arr.write('1234567890')", setup="from hard import StringInside; arr = StringInside()", number=10000)
0.0331270694732666

The difference increases rapidly at the number goes up - at 1 million interactions, StringInside took longer than I was willing to wait to return, while the pure str version returned in ~100ms.

UPDATE 2

For posterity, I decided to write a cython class wrapping a C++ string to see if performance could be improved compared to one loosely based on Mike Müller's updated version below, and I managed to succeed. I realize cython is "cheating" but I provide this just for fun.

python version:

class Mike(object):

    def __init__(self, data=''):
        self._data = []
        self._data.extend(data)

    def write(self, data):
        self._data.extend(data)

    def read(self, stop=None):
        return ''.join(self._data[0:stop])

    def pop(self, stop=None):
        if not stop:
            stop = len(self._data)
        try:
            return ''.join(self._data[0:stop])
        finally:
            self._data = self._data[stop:]

    def __getitem__(self, key):
        return ''.join(self._data[key])

cython version:

from libcpp.string cimport string

cdef class CyString:
    cdef string buff
    cdef public int length

    def __cinit__(self, string data=''):
        self.length = len(data)
        self.buff = data

    def write(self, string new_data):
        self.length += len(new_data)
        self.buff += new_data

    def read(self, int length=0):
        if not length:
            length = self.length
        return self.buff.substr(0, length)  

    def pop(self, int length=0):
        if not length:
            length = self.length
        ans = self.buff.substr(0, length)
        self.buff.erase(0, length)
        return ans

performance:

writing

>>> timeit.timeit("arr.write('1234567890')", setup="from pyversion import Mike; arr = Mike()", number=1000000)
0.5992741584777832
>>> timeit.timeit("arr.write('1234567890')", setup="from cyversion import CyBuff; arr = CyBuff()", number=1000000)
0.17381906509399414

reading

>>> timeit.timeit("arr.write('1234567890'); arr.read(5)", setup="from pyversion import Mike; arr = Mike()", number=1000000)
1.1499049663543701
>>> timeit.timeit("arr.write('1234567890'); arr.read(5)", setup="from cyversion import CyBuff; arr = CyBuff()", number=1000000)
0.2894480228424072

popping

>>> # note I'm using 10e3 iterations - the python version wouldn't return otherwise
>>> timeit.timeit("arr.write('1234567890'); arr.pop(5)", setup="from pyversion import Mike; arr = Mike()", number=10000)
0.7390561103820801
>>> timeit.timeit("arr.write('1234567890'); arr.pop(5)", setup="from cyversion import CyBuff; arr = CyBuff()", number=10000)
0.01501607894897461
like image 550
domoarigato Avatar asked Jan 16 '16 21:01

domoarigato


1 Answers

Solution

This is an answer to the updated question.

You can use a list to hold data and only construct the string when reading it:

class StringInside(object):

    def __init__(self, data=''):
        self._data = []
        self._data.append(data)

    def write(self, data):
        self._data.append(data)

    def read(self):
        return ''.join(self._data)

Performance

The performance of this class:

%%timeit arr = StringInside()
arr.write('1234567890')
1000000 loops, best of 3: 352 ns per loop

is much closer to that of the native str:

%%timeit str_arr = ''
str_arr+='1234567890'
1000000 loops, best of 3: 222 ns per loop

Compare with your version:

%%timeit arr = StringInsidePlusEqual()
arr.write('1234567890')
100000 loops, best of 3: 87 µs per loop

Reason

The my_string += another_string way of building a string has been an anti-pattern performance wise for a long time. CPython has some optimizations for this case. Seems like CPython cannot detect that this pattern is used here. This likely because it a bit hidden inside a class.

Not all implementations have this optimization for various reasons. For example. PyPy, which in general is much faster than CPython, is considerably slower for this use case:

PyPy 2.6.0 (Python 2.7.9)

>>>> import timeit
>>>> timeit.timeit("arr+='1234567890'", setup="arr = ''", number=10000)
0.08312582969665527

CPython 2.7.11

>>> import timeit
>>> timeit.timeit("arr+='1234567890'", setup="arr = ''", number=10000)
0.002151966094970703

Slice-able version

This version supports slicing:

class StringInside(object):

    def __init__(self, data=''):
        self._data = []
        self._data.extend(data)

    def write(self, data):
        self._data.extend(data)

    def read(self, start=None, stop=None):
        return ''.join(self._data[start:stop])

    def __getitem__(self, key):
        return ''.join(self._data[key])

You can slice the normal way:

>>> arr = StringInside('abcdefg')
>>> arr[2]
'c'
>>> arr[1:3]
'bc'

Now, read() also supports optional start and stop indices:

>>>  arr.read()
'abcdefg'
>>> arr.read(1, 3)
'bc'

>>> arr.read(1)
'bcdefg'
like image 189
Mike Müller Avatar answered Oct 13 '22 14:10

Mike Müller