I have encountered a strange performance problem when adding to an str class member in python 2.7.3. I know that accessing local variables are faster, however, in the problem below, there is more than 100x speed difference between the two loops. The one that accesses a.accum_ starts fast but slows down, as if str iadd is O(n^2) with the length of str.
Does anyone know the reason?
# Fast ( < 1sec):
accum = str()
for ii in range(1000000):
if (ii % 10000) == 0:
print 'fast cnt', ii
accum += 'zzzzz\n'
# Much slower ( > 5 mins):
class Foo:
pass
a = Foo()
a.accum_ = str()
for ii in range(1000000):
if (ii % 10000) == 0:
print 'slow cnt', ii
a.accum_ += 'zzzzz\n'
For the first example it is pretty clear that it is case of single reference optimisation(there are two references actually: one from the object itself and one LOAD_FAST
; unicode_concatenate
will try to reduce it to 1 before passing control to PyUnicode_Append
) done by CPython using this unicode_modifiable
function:
static int
unicode_modifiable(PyObject *unicode)
{
assert(_PyUnicode_CHECK(unicode));
if (Py_REFCNT(unicode) != 1)
return 0;
if (_PyUnicode_HASH(unicode) != -1)
return 0;
if (PyUnicode_CHECK_INTERNED(unicode))
return 0;
if (!PyUnicode_CheckExact(unicode))
return 0;
#ifdef Py_DEBUG
/* singleton refcount is greater than 1 */
assert(!unicode_is_singleton(unicode));
#endif
return 1;
}
But in the second case as the instance data is stored in a Python dict
rather than a simple variable, so the things get little different.
a.accum_ += 'foo'
actually requires prefetching the value of a.accum_
and storing it on to the stack. So, now the string has atleast three references: one from the instance dictionary, one from DUP_TOP
and one from PyObject_GetAttr
used by LOAD_ATTR
. Hence, Python cannot optimise this case as modifying one of them in-place will affect other references as well.
>>> class A:
pass
...
>>> a = A()
>>> def func():
a.str = 'spam'
print a.str
return '_from_func'
...
>>> a.str = 'foo'
>>> a.str += func()
spam
You would expect the output here to be 'spam_from_func'
, but it is going to be different because the original value of a.str
was stored by Python before func()
was called.
>>> a.str
'foo_from_func'
Byte Code:
>>> import dis
>>> def func_class():
a = Foo()
a.accum = ''
a.accum += 'zzzzz\n'
...
>>> dis.dis(func_class)
2 0 LOAD_GLOBAL 0 (Foo)
3 CALL_FUNCTION 0 (0 positional, 0 keyword pair)
6 STORE_FAST 0 (a)
3 9 LOAD_CONST 1 ('')
12 LOAD_FAST 0 (a)
15 STORE_ATTR 1 (accum)
4 18 LOAD_FAST 0 (a)
21 DUP_TOP
22 LOAD_ATTR 1 (accum)
25 LOAD_CONST 2 ('zzzzz\n')
28 INPLACE_ADD
29 ROT_TWO
30 STORE_ATTR 1 (accum)
33 LOAD_CONST 0 (None)
36 RETURN_VALUE
Note that this optimisation was done in around 2004(CPython 2.4) to prevent users from
slowness of a += b
or a = a + b
, so it is mostly meant for simple variables and works only if the next instruction is STORE_FAST
(local variable), STORE_DEREF
(closures) and STORE_NAME
. It is not a general solution, the best way to do this in Python is to create a list and join its items using str.join
.
CPython implementation detail: If
s
andt
are both strings, some Python implementations such as CPython can usually perform an in-place optimization for assignments of the forms = s + t
ors += t
. When applicable, this optimization makes quadratic run-time much less likely. This optimization is both version and implementation dependent. For performance sensitive code, it is preferable to use thestr.join()
method which assures consistent linear concatenation performance across versions and implementations.
Python strings are immutable and so can't have an __iadd__
method. What you are witnessing in the first example is a micro optimisation of the CPython interpreter. In the first example the interpreter has noticed that it has a local variable that has a reference count of 1. Thus, the interpreter can cheekily get away with modifying the string in place. Even though this violates the contract of str
, at no point during the execution of the program will it be apparent that this contract was briefly violated.
In the latter example this micro optimisation isn't being implemented, which is why it's so slow. It looks like the optimisation could be applied, so I'm not sure why it isn't being applied.
Generally though, if building a string, collate the substrings in a list and then use str.join
to create the end product.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With