I was playing around with timeit and noticed that doing a simple list comprehension over a small string took longer than doing the same operation on a list of small single character strings. Any explanation? It's almost 1.35 times as much time. <pre class="prettyprint"><code>>>> from timeit import timeit >>> timeit("[x for x in 'abc']") 2.0691067844831528 >>> timeit("[x for x in ['a', 'b', 'c']]") 1.5286479570345861 </code></pre> What's happening on a lower level that's causing this?

<h3>TL;DR</h3> <ul> <li>The actual speed difference is closer to 70% (or more) once a lot of the overhead is removed, for Python 2.</li> <li>Object creation is not at fault. Neither method creates a new object, as one-character strings are cached.</li> <li>The difference is unobvious, but is likely created from a greater number of checks on string indexing, with regards to the type and well-formedness. It is also quite likely thanks to the need to check what to return.</li> <li>List indexing is remarkably fast.</li> </ul> <hr> <hr> <pre class="prettyprint"><code>>>> python3 -m timeit '[x for x in "abc"]' 1000000 loops, best of 3: 0.388 usec per loop >>> python3 -m timeit '[x for x in ["a", "b", "c"]]' 1000000 loops, best of 3: 0.436 usec per loop </code></pre> This disagrees with what you've found... You must be using Python 2, then. <pre class="prettyprint"><code>>>> python2 -m timeit '[x for x in "abc"]' 1000000 loops, best of 3: 0.309 usec per loop >>> python2 -m timeit '[x for x in ["a", "b", "c"]]' 1000000 loops, best of 3: 0.212 usec per loop </code></pre> Let's explain the difference between the versions. I'll examine the compiled code. For Python 3: <pre class="prettyprint"><code>import dis def list_iterate(): [item for item in ["a", "b", "c"]] dis.dis(list_iterate) #>>> 4 0 LOAD_CONST 1 (<code object <listcomp> at 0x7f4d06b118a0, file "", line 4>) #>>> 3 LOAD_CONST 2 ('list_iterate.<locals>.<listcomp>') #>>> 6 MAKE_FUNCTION 0 #>>> 9 LOAD_CONST 3 ('a') #>>> 12 LOAD_CONST 4 ('b') #>>> 15 LOAD_CONST 5 ('c') #>>> 18 BUILD_LIST 3 #>>> 21 GET_ITER #>>> 22 CALL_FUNCTION 1 (1 positional, 0 keyword pair) #>>> 25 POP_TOP #>>> 26 LOAD_CONST 0 (None) #>>> 29 RETURN_VALUE def string_iterate(): [item for item in "abc"] dis.dis(string_iterate) #>>> 21 0 LOAD_CONST 1 (<code object <listcomp> at 0x7f4d06b17150, file "", line 21>) #>>> 3 LOAD_CONST 2 ('string_iterate.<locals>.<listcomp>') #>>> 6 MAKE_FUNCTION 0 #>>> 9 LOAD_CONST 3 ('abc') #>>> 12 GET_ITER #>>> 13 CALL_FUNCTION 1 (1 positional, 0 keyword pair) #>>> 16 POP_TOP #>>> 17 LOAD_CONST 0 (None) #>>> 20 RETURN_VALUE </code></pre> You see here that the list variant is likely to be slower due to the building of the list each time. This is the <pre class="prettyprint"><code> 9 LOAD_CONST 3 ('a') 12 LOAD_CONST 4 ('b') 15 LOAD_CONST 5 ('c') 18 BUILD_LIST 3 </code></pre> part. The string variant only has <pre class="prettyprint"><code> 9 LOAD_CONST 3 ('abc') </code></pre> You can check that this does seem to make a difference: <pre class="prettyprint"><code>def string_iterate(): [item for item in ("a", "b", "c")] dis.dis(string_iterate) #>>> 35 0 LOAD_CONST 1 (<code object <listcomp> at 0x7f4d068be660, file "", line 35>) #>>> 3 LOAD_CONST 2 ('string_iterate.<locals>.<listcomp>') #>>> 6 MAKE_FUNCTION 0 #>>> 9 LOAD_CONST 6 (('a', 'b', 'c')) #>>> 12 GET_ITER #>>> 13 CALL_FUNCTION 1 (1 positional, 0 keyword pair) #>>> 16 POP_TOP #>>> 17 LOAD_CONST 0 (None) #>>> 20 RETURN_VALUE </code></pre> This produces just <pre class="prettyprint"><code> 9 LOAD_CONST 6 (('a', 'b', 'c')) </code></pre> as tuples are immutable. Test: <pre class="prettyprint"><code>>>> python3 -m timeit '[x for x in ("a", "b", "c")]' 1000000 loops, best of 3: 0.369 usec per loop </code></pre> Great, back up to speed. For Python 2: <pre class="prettyprint"><code>def list_iterate(): [item for item in ["a", "b", "c"]] dis.dis(list_iterate) #>>> 2 0 BUILD_LIST 0 #>>> 3 LOAD_CONST 1 ('a') #>>> 6 LOAD_CONST 2 ('b') #>>> 9 LOAD_CONST 3 ('c') #>>> 12 BUILD_LIST 3 #>>> 15 GET_ITER #>>> >> 16 FOR_ITER 12 (to 31) #>>> 19 STORE_FAST 0 (item) #>>> 22 LOAD_FAST 0 (item) #>>> 25 LIST_APPEND 2 #>>> 28 JUMP_ABSOLUTE 16 #>>> >> 31 POP_TOP #>>> 32 LOAD_CONST 0 (None) #>>> 35 RETURN_VALUE def string_iterate(): [item for item in "abc"] dis.dis(string_iterate) #>>> 2 0 BUILD_LIST 0 #>>> 3 LOAD_CONST 1 ('abc') #>>> 6 GET_ITER #>>> >> 7 FOR_ITER 12 (to 22) #>>> 10 STORE_FAST 0 (item) #>>> 13 LOAD_FAST 0 (item) #>>> 16 LIST_APPEND 2 #>>> 19 JUMP_ABSOLUTE 7 #>>> >> 22 POP_TOP #>>> 23 LOAD_CONST 0 (None) #>>> 26 RETURN_VALUE </code></pre> The odd thing is that we have the same building of the list, but it's still faster for this. Python 2 is acting strangely fast. Let's remove the comprehensions and re-time. The <code>_ =</code> is to prevent it getting optimised out. <pre class="prettyprint"><code>>>> python3 -m timeit '_ = ["a", "b", "c"]' 10000000 loops, best of 3: 0.0707 usec per loop >>> python3 -m timeit '_ = "abc"' 100000000 loops, best of 3: 0.0171 usec per loop </code></pre> We can see that initialization is not significant enough to account for the difference between the versions (those numbers are small)! We can thus conclude that Python 3 has slower comprehensions. This makes sense as Python 3 changed comprehensions to have safer scoping. Well, now improve the benchmark (I'm just removing overhead that isn't iteration). This removes the building of the iterable by pre-assigning it: <pre class="prettyprint"><code>>>> python3 -m timeit -s 'iterable = "abc"' '[x for x in iterable]' 1000000 loops, best of 3: 0.387 usec per loop >>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]' 1000000 loops, best of 3: 0.368 usec per loop </code></pre> <pre class="prettyprint"><code>>>> python2 -m timeit -s 'iterable = "abc"' '[x for x in iterable]' 1000000 loops, best of 3: 0.309 usec per loop >>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]' 10000000 loops, best of 3: 0.164 usec per loop </code></pre> We can check if calling <code>iter</code> is the overhead: <pre class="prettyprint"><code>>>> python3 -m timeit -s 'iterable = "abc"' 'iter(iterable)' 10000000 loops, best of 3: 0.099 usec per loop >>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)' 10000000 loops, best of 3: 0.1 usec per loop </code></pre> <pre class="prettyprint"><code>>>> python2 -m timeit -s 'iterable = "abc"' 'iter(iterable)' 10000000 loops, best of 3: 0.0913 usec per loop >>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)' 10000000 loops, best of 3: 0.0854 usec per loop </code></pre> No. No it is not. The difference is too small, especially for Python 3. So let's remove yet more unwanted overhead... by making the whole thing slower! The aim is just to have a longer iteration so the time hides overhead. <pre class="prettyprint"><code>>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]' 100 loops, best of 3: 3.12 msec per loop >>> python3 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]' 100 loops, best of 3: 2.77 msec per loop </code></pre> <pre class="prettyprint"><code>>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]' 100 loops, best of 3: 2.32 msec per loop >>> python2 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]' 100 loops, best of 3: 2.09 msec per loop </code></pre> This hasn't actually changed much, but it's helped a little. So remove the comprehension. It's overhead that's not part of the question: <pre class="prettyprint"><code>>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass' 1000 loops, best of 3: 1.71 msec per loop >>> python3 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass' 1000 loops, best of 3: 1.36 msec per loop </code></pre> <pre class="prettyprint"><code>>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass' 1000 loops, best of 3: 1.27 msec per loop >>> python2 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass' 1000 loops, best of 3: 935 usec per loop </code></pre> That's more like it! We can get slightly faster still by using <code>deque</code> to iterate. It's basically the same, but it's faster: <pre class="prettyprint"><code>>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 777 usec per loop >>> python3 -m timeit -s 'import random; from collections import deque; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 405 usec per loop </code></pre> <pre class="prettyprint"><code>>>> python2 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 805 usec per loop >>> python2 -m timeit -s 'import random; from collections import deque; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 438 usec per loop </code></pre> What impresses me is that Unicode is competitive with bytestrings. We can check this explicitly by trying <code>bytes</code> and <code>unicode</code> in both: <ul> <li> <code>bytes</code> <pre class="prettyprint"><code>>>> python3 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127)).encode("ascii") for _ in range(100000))' 'deque(iterable, maxlen=0)' :( 1000 loops, best of 3: 571 usec per loop >>> python3 -m timeit -s 'import random; from collections import deque; iterable = [chr(random.randint(0, 127)).encode("ascii") for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 394 usec per loop </code></pre> <pre class="prettyprint"><code>>>> python2 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 757 usec per loop >>> python2 -m timeit -s 'import random; from collections import deque; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 438 usec per loop </code></pre> Here you see Python 3 actually faster than Python 2. </li> <li> <code>unicode</code> <pre class="prettyprint"><code>>>> python3 -m timeit -s 'import random; from collections import deque; iterable = u"".join( chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 800 usec per loop >>> python3 -m timeit -s 'import random; from collections import deque; iterable = [ chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 394 usec per loop </code></pre> <pre class="prettyprint"><code>>>> python2 -m timeit -s 'import random; from collections import deque; iterable = u"".join(unichr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 1.07 msec per loop >>> python2 -m timeit -s 'import random; from collections import deque; iterable = [unichr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 469 usec per loop </code></pre> Again, Python 3 is faster, although this is to be expected (<code>str</code> has had a lot of attention in Python 3). </li> </ul> In fact, this <code>unicode</code>-<code>bytes</code> difference is very small, which is impressive. So let's analyse this one case, seeing as it's fast and convenient for me: <pre class="prettyprint"><code>>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 777 usec per loop >>> python3 -m timeit -s 'import random; from collections import deque; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 405 usec per loop </code></pre> We can actually rule out Tim Peter's 10-times-upvoted answer! <pre class="prettyprint"><code>>>> foo = iterable[123] >>> iterable[36] is foo True </code></pre> <h3>These are not new objects!</h3> But this is worth mentioning: indexing costs. The difference will likely be in the indexing, so remove the iteration and just index: <pre class="prettyprint"><code>>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'iterable[123]' 10000000 loops, best of 3: 0.0397 usec per loop >>> python3 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable[123]' 10000000 loops, best of 3: 0.0374 usec per loop </code></pre> The difference seems small, but at least half of the cost is overhead: <pre class="prettyprint"><code>>>> python3 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable; 123' 100000000 loops, best of 3: 0.0173 usec per loop </code></pre> so the speed difference is sufficient to decide to blame it. I think. So why is indexing a list so much faster? Well, I'll come back to you on that, but my guess is that's is down to the check for interned strings (or cached characters if it's a separate mechanism). This will be less fast than optimal. But I'll go check the source (although I'm not comfortable in C...) :). <hr> So here's the source: <pre class="prettyprint"><code>static PyObject * unicode_getitem(PyObject *self, Py_ssize_t index) { void *data; enum PyUnicode_Kind kind; Py_UCS4 ch; PyObject *res; if (!PyUnicode_Check(self) || PyUnicode_READY(self) == -1) { PyErr_BadArgument(); return NULL; } if (index < 0 || index >= PyUnicode_GET_LENGTH(self)) { PyErr_SetString(PyExc_IndexError, "string index out of range"); return NULL; } kind = PyUnicode_KIND(self); data = PyUnicode_DATA(self); ch = PyUnicode_READ(kind, data, index); if (ch < 256) return get_latin1_char(ch); res = PyUnicode_New(1, ch); if (res == NULL) return NULL; kind = PyUnicode_KIND(res); data = PyUnicode_DATA(res); PyUnicode_WRITE(kind, data, 0, ch); assert(_PyUnicode_CheckConsistency(res, 1)); return res; } </code></pre> Walking from the top, we'll have some checks. These are boring. Then some assigns, which should also be boring. The first interesting line is <pre class="prettyprint"><code>ch = PyUnicode_READ(kind, data, index); </code></pre> but we'd hope that is fast, as we're reading from a contiguous C array by indexing it. The result, <code>ch</code>, will be less than 256 so we'll return the cached character in <code>get_latin1_char(ch)</code>. So we'll run (dropping the first checks) <pre class="prettyprint"><code>kind = PyUnicode_KIND(self); data = PyUnicode_DATA(self); ch = PyUnicode_READ(kind, data, index); return get_latin1_char(ch); </code></pre> Where <pre class="prettyprint"><code>#define PyUnicode_KIND(op) \ (assert(PyUnicode_Check(op)), \ assert(PyUnicode_IS_READY(op)), \ ((PyASCIIObject *)(op))->state.kind) </code></pre> (which is boring because asserts get ignored in debug [so I can check that they're fast] and <code>((PyASCIIObject *)(op))->state.kind)</code> is (I think) an indirection and a C-level cast); <pre class="prettyprint"><code>#define PyUnicode_DATA(op) \ (assert(PyUnicode_Check(op)), \ PyUnicode_IS_COMPACT(op) ? _PyUnicode_COMPACT_DATA(op) : \ _PyUnicode_NONCOMPACT_DATA(op)) </code></pre> (which is also boring for similar reasons, assuming the macros (<code>Something_CAPITALIZED</code>) are all fast), <pre class="prettyprint"><code>#define PyUnicode_READ(kind, data, index) \ ((Py_UCS4) \ ((kind) == PyUnicode_1BYTE_KIND ? \ ((const Py_UCS1 *)(data))[(index)] : \ ((kind) == PyUnicode_2BYTE_KIND ? \ ((const Py_UCS2 *)(data))[(index)] : \ ((const Py_UCS4 *)(data))[(index)] \ ) \ )) </code></pre> (which involves indexes but really isn't slow at all) and <pre class="prettyprint"><code>static PyObject* get_latin1_char(unsigned char ch) { PyObject *unicode = unicode_latin1[ch]; if (!unicode) { unicode = PyUnicode_New(1, ch); if (!unicode) return NULL; PyUnicode_1BYTE_DATA(unicode)[0] = ch; assert(_PyUnicode_CheckConsistency(unicode, 1)); unicode_latin1[ch] = unicode; } Py_INCREF(unicode); return unicode; } </code></pre> Which confirms my suspicion that: <ul> <li> This is cached: <pre class="prettyprint"><code>PyObject *unicode = unicode_latin1[ch]; </code></pre> </li> <li> This should be fast. The <code>if (!unicode)</code> is not run, so it's literally equivalent in this case to <pre class="prettyprint"><code>PyObject *unicode = unicode_latin1[ch]; Py_INCREF(unicode); return unicode; </code></pre> </li> </ul> Honestly, after testing the <code>assert</code>s are fast (by disabling them [I think it works on the C-level asserts...]), the only plausibly-slow parts are: <pre class="prettyprint"><code>PyUnicode_IS_COMPACT(op) _PyUnicode_COMPACT_DATA(op) _PyUnicode_NONCOMPACT_DATA(op) </code></pre> Which are: <pre class="prettyprint"><code>#define PyUnicode_IS_COMPACT(op) \ (((PyASCIIObject*)(op))->state.compact) </code></pre> (fast, as before), <pre class="prettyprint"><code>#define _PyUnicode_COMPACT_DATA(op) \ (PyUnicode_IS_ASCII(op) ? \ ((void*)((PyASCIIObject*)(op) + 1)) : \ ((void*)((PyCompactUnicodeObject*)(op) + 1))) </code></pre> (fast if the macro <code>IS_ASCII</code> is fast), and <pre class="prettyprint"><code>#define _PyUnicode_NONCOMPACT_DATA(op) \ (assert(((PyUnicodeObject*)(op))->data.any), \ ((((PyUnicodeObject *)(op))->data.any))) </code></pre> (also fast as it's an assert plus an indirection plus a cast). So we're down (the rabbit hole) to: <pre class="prettyprint"><code>PyUnicode_IS_ASCII </code></pre> which is <pre class="prettyprint"><code>#define PyUnicode_IS_ASCII(op) \ (assert(PyUnicode_Check(op)), \ assert(PyUnicode_IS_READY(op)), \ ((PyASCIIObject*)op)->state.ascii) </code></pre> Hmm... that seems fast too... <hr> Well, OK, but let's compare it to <code>PyList_GetItem</code>. (Yeah, thanks Tim Peters for giving me more work to do :P.) <pre class="prettyprint"><code>PyObject * PyList_GetItem(PyObject *op, Py_ssize_t i) { if (!PyList_Check(op)) { PyErr_BadInternalCall(); return NULL; } if (i < 0 || i >= Py_SIZE(op)) { if (indexerr == NULL) { indexerr = PyUnicode_FromString( "list index out of range"); if (indexerr == NULL) return NULL; } PyErr_SetObject(PyExc_IndexError, indexerr); return NULL; } return ((PyListObject *)op) -> ob_item[i]; } </code></pre> We can see that on non-error cases this is just going to run: <pre class="prettyprint"><code>PyList_Check(op) Py_SIZE(op) ((PyListObject *)op) -> ob_item[i] </code></pre> Where <code>PyList_Check</code> is <pre class="prettyprint"><code>#define PyList_Check(op) \ PyType_FastSubclass(Py_TYPE(op), Py_TPFLAGS_LIST_SUBCLASS) </code></pre> <strike>(TABS! TABS!!!) (issue21587)</strike> That got fixed and merged in 5 minutes. Like... yeah. Damn. They put Skeet to shame. <pre class="prettyprint"><code>#define Py_SIZE(ob) (((PyVarObject*)(ob))->ob_size) </code></pre> <pre class="prettyprint"><code>#define PyType_FastSubclass(t,f) PyType_HasFeature(t,f) </code></pre> <pre class="prettyprint"><code>#ifdef Py_LIMITED_API #define PyType_HasFeature(t,f) ((PyType_GetFlags(t) & (f)) != 0) #else #define PyType_HasFeature(t,f) (((t)->tp_flags & (f)) != 0) #endif </code></pre> So this is normally really trivial (two indirections and a couple of boolean checks) unless <code>Py_LIMITED_API</code> is on, in which case... ??? Then there's the indexing and a cast (<code>((PyListObject *)op) -> ob_item[i]</code>) and we're done. So there are definitely fewer checks for lists, and the small speed differences certainly imply that it could be relevant. <hr> I think in general, there's just more type-checking and indirection <code>(->)</code> for Unicode. It seems I'm missing a point, but what?

Why is it slower to iterate over a small string than a small list?

Tags:

performance

python

python-internals

timeit

cpython

I was playing around with timeit and noticed that doing a simple list comprehension over a small string took longer than doing the same operation on a list of small single character strings. Any explanation? It's almost 1.35 times as much time.

>>> from timeit import timeit >>> timeit("[x for x in 'abc']") 2.0691067844831528 >>> timeit("[x for x in ['a', 'b', 'c']]") 1.5286479570345861

What's happening on a lower level that's causing this?

345

asked May 26 '14 01:05

Sunjay Varma

2 Answers

TL;DR

The actual speed difference is closer to 70% (or more) once a lot of the overhead is removed, for Python 2.
Object creation is not at fault. Neither method creates a new object, as one-character strings are cached.
The difference is unobvious, but is likely created from a greater number of checks on string indexing, with regards to the type and well-formedness. It is also quite likely thanks to the need to check what to return.
List indexing is remarkably fast.

>>> python3 -m timeit '[x for x in "abc"]' 1000000 loops, best of 3: 0.388 usec per loop  >>> python3 -m timeit '[x for x in ["a", "b", "c"]]' 1000000 loops, best of 3: 0.436 usec per loop

This disagrees with what you've found...

You must be using Python 2, then.

>>> python2 -m timeit '[x for x in "abc"]' 1000000 loops, best of 3: 0.309 usec per loop  >>> python2 -m timeit '[x for x in ["a", "b", "c"]]' 1000000 loops, best of 3: 0.212 usec per loop

Let's explain the difference between the versions. I'll examine the compiled code.

For Python 3:

import dis  def list_iterate():     [item for item in ["a", "b", "c"]]  dis.dis(list_iterate) #>>>   4           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d06b118a0, file "", line 4>) #>>>               3 LOAD_CONST               2 ('list_iterate.<locals>.<listcomp>') #>>>               6 MAKE_FUNCTION            0 #>>>               9 LOAD_CONST               3 ('a') #>>>              12 LOAD_CONST               4 ('b') #>>>              15 LOAD_CONST               5 ('c') #>>>              18 BUILD_LIST               3 #>>>              21 GET_ITER #>>>              22 CALL_FUNCTION            1 (1 positional, 0 keyword pair) #>>>              25 POP_TOP #>>>              26 LOAD_CONST               0 (None) #>>>              29 RETURN_VALUE  def string_iterate():     [item for item in "abc"]  dis.dis(string_iterate) #>>>  21           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d06b17150, file "", line 21>) #>>>               3 LOAD_CONST               2 ('string_iterate.<locals>.<listcomp>') #>>>               6 MAKE_FUNCTION            0 #>>>               9 LOAD_CONST               3 ('abc') #>>>              12 GET_ITER #>>>              13 CALL_FUNCTION            1 (1 positional, 0 keyword pair) #>>>              16 POP_TOP #>>>              17 LOAD_CONST               0 (None) #>>>              20 RETURN_VALUE

You see here that the list variant is likely to be slower due to the building of the list each time.

This is the

 9 LOAD_CONST   3 ('a') 12 LOAD_CONST   4 ('b') 15 LOAD_CONST   5 ('c') 18 BUILD_LIST   3

part. The string variant only has

 9 LOAD_CONST   3 ('abc')

You can check that this does seem to make a difference:

def string_iterate():     [item for item in ("a", "b", "c")]  dis.dis(string_iterate) #>>>  35           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d068be660, file "", line 35>) #>>>               3 LOAD_CONST               2 ('string_iterate.<locals>.<listcomp>') #>>>               6 MAKE_FUNCTION            0 #>>>               9 LOAD_CONST               6 (('a', 'b', 'c')) #>>>              12 GET_ITER #>>>              13 CALL_FUNCTION            1 (1 positional, 0 keyword pair) #>>>              16 POP_TOP #>>>              17 LOAD_CONST               0 (None) #>>>              20 RETURN_VALUE

This produces just

 9 LOAD_CONST               6 (('a', 'b', 'c'))

as tuples are immutable. Test:

>>> python3 -m timeit '[x for x in ("a", "b", "c")]' 1000000 loops, best of 3: 0.369 usec per loop

Great, back up to speed.

For Python 2:

def list_iterate():     [item for item in ["a", "b", "c"]]  dis.dis(list_iterate) #>>>   2           0 BUILD_LIST               0 #>>>               3 LOAD_CONST               1 ('a') #>>>               6 LOAD_CONST               2 ('b') #>>>               9 LOAD_CONST               3 ('c') #>>>              12 BUILD_LIST               3 #>>>              15 GET_ITER             #>>>         >>   16 FOR_ITER                12 (to 31) #>>>              19 STORE_FAST               0 (item) #>>>              22 LOAD_FAST                0 (item) #>>>              25 LIST_APPEND              2 #>>>              28 JUMP_ABSOLUTE           16 #>>>         >>   31 POP_TOP              #>>>              32 LOAD_CONST               0 (None) #>>>              35 RETURN_VALUE          def string_iterate():     [item for item in "abc"]  dis.dis(string_iterate) #>>>   2           0 BUILD_LIST               0 #>>>               3 LOAD_CONST               1 ('abc') #>>>               6 GET_ITER             #>>>         >>    7 FOR_ITER                12 (to 22) #>>>              10 STORE_FAST               0 (item) #>>>              13 LOAD_FAST                0 (item) #>>>              16 LIST_APPEND              2 #>>>              19 JUMP_ABSOLUTE            7 #>>>         >>   22 POP_TOP              #>>>              23 LOAD_CONST               0 (None) #>>>              26 RETURN_VALUE

The odd thing is that we have the same building of the list, but it's still faster for this. Python 2 is acting strangely fast.

Let's remove the comprehensions and re-time. The _ = is to prevent it getting optimised out.

>>> python3 -m timeit '_ = ["a", "b", "c"]' 10000000 loops, best of 3: 0.0707 usec per loop  >>> python3 -m timeit '_ = "abc"' 100000000 loops, best of 3: 0.0171 usec per loop

We can see that initialization is not significant enough to account for the difference between the versions (those numbers are small)! We can thus conclude that Python 3 has slower comprehensions. This makes sense as Python 3 changed comprehensions to have safer scoping.

Well, now improve the benchmark (I'm just removing overhead that isn't iteration). This removes the building of the iterable by pre-assigning it:

>>> python3 -m timeit -s 'iterable = "abc"'           '[x for x in iterable]' 1000000 loops, best of 3: 0.387 usec per loop  >>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]' 1000000 loops, best of 3: 0.368 usec per loop

>>> python2 -m timeit -s 'iterable = "abc"'           '[x for x in iterable]' 1000000 loops, best of 3: 0.309 usec per loop  >>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]' 10000000 loops, best of 3: 0.164 usec per loop

We can check if calling iter is the overhead:

>>> python3 -m timeit -s 'iterable = "abc"'           'iter(iterable)' 10000000 loops, best of 3: 0.099 usec per loop  >>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)' 10000000 loops, best of 3: 0.1 usec per loop

>>> python2 -m timeit -s 'iterable = "abc"'           'iter(iterable)' 10000000 loops, best of 3: 0.0913 usec per loop  >>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)' 10000000 loops, best of 3: 0.0854 usec per loop

No. No it is not. The difference is too small, especially for Python 3.

So let's remove yet more unwanted overhead... by making the whole thing slower! The aim is just to have a longer iteration so the time hides overhead.

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]' 100 loops, best of 3: 3.12 msec per loop  >>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]' 100 loops, best of 3: 2.77 msec per loop

>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]' 100 loops, best of 3: 2.32 msec per loop  >>> python2 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]' 100 loops, best of 3: 2.09 msec per loop

This hasn't actually changed much, but it's helped a little.

So remove the comprehension. It's overhead that's not part of the question:

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass' 1000 loops, best of 3: 1.71 msec per loop  >>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass' 1000 loops, best of 3: 1.36 msec per loop

>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass' 1000 loops, best of 3: 1.27 msec per loop  >>> python2 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass' 1000 loops, best of 3: 935 usec per loop

That's more like it! We can get slightly faster still by using deque to iterate. It's basically the same, but it's faster:

>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 777 usec per loop  >>> python3 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 405 usec per loop

>>> python2 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 805 usec per loop  >>> python2 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 438 usec per loop

What impresses me is that Unicode is competitive with bytestrings. We can check this explicitly by trying bytes and unicode in both:

bytes

>>> python3 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127)).encode("ascii") for _ in range(100000))' 'deque(iterable, maxlen=0)'                                                                    :( 1000 loops, best of 3: 571 usec per loop  >>> python3 -m timeit -s 'import random; from collections import deque; iterable =         [chr(random.randint(0, 127)).encode("ascii") for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 394 usec per loop

>>> python2 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127))                 for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 757 usec per loop  >>> python2 -m timeit -s 'import random; from collections import deque; iterable =         [chr(random.randint(0, 127))                 for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 438 usec per loop

Here you see Python 3 actually faster than Python 2.

unicode

>>> python3 -m timeit -s 'import random; from collections import deque; iterable = u"".join(   chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 800 usec per loop  >>> python3 -m timeit -s 'import random; from collections import deque; iterable =         [   chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 394 usec per loop

>>> python2 -m timeit -s 'import random; from collections import deque; iterable = u"".join(unichr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 1.07 msec per loop  >>> python2 -m timeit -s 'import random; from collections import deque; iterable =         [unichr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 469 usec per loop

Again, Python 3 is faster, although this is to be expected (str has had a lot of attention in Python 3).

In fact, this unicode-bytes difference is very small, which is impressive.

So let's analyse this one case, seeing as it's fast and convenient for me:

>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 777 usec per loop  >>> python3 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 405 usec per loop

We can actually rule out Tim Peter's 10-times-upvoted answer!

>>> foo = iterable[123] >>> iterable[36] is foo True

These are not new objects!

But this is worth mentioning: indexing costs. The difference will likely be in the indexing, so remove the iteration and just index:

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'iterable[123]' 10000000 loops, best of 3: 0.0397 usec per loop  >>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable[123]' 10000000 loops, best of 3: 0.0374 usec per loop

The difference seems small, but at least half of the cost is overhead:

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable; 123' 100000000 loops, best of 3: 0.0173 usec per loop

so the speed difference is sufficient to decide to blame it. I think.

So why is indexing a list so much faster?

Well, I'll come back to you on that, but my guess is that's is down to the check for interned strings (or cached characters if it's a separate mechanism). This will be less fast than optimal. But I'll go check the source (although I'm not comfortable in C...) :).

So here's the source:

static PyObject * unicode_getitem(PyObject *self, Py_ssize_t index) {     void *data;     enum PyUnicode_Kind kind;     Py_UCS4 ch;     PyObject *res;      if (!PyUnicode_Check(self) || PyUnicode_READY(self) == -1) {         PyErr_BadArgument();         return NULL;     }     if (index < 0 || index >= PyUnicode_GET_LENGTH(self)) {         PyErr_SetString(PyExc_IndexError, "string index out of range");         return NULL;     }     kind = PyUnicode_KIND(self);     data = PyUnicode_DATA(self);     ch = PyUnicode_READ(kind, data, index);     if (ch < 256)         return get_latin1_char(ch);      res = PyUnicode_New(1, ch);     if (res == NULL)         return NULL;     kind = PyUnicode_KIND(res);     data = PyUnicode_DATA(res);     PyUnicode_WRITE(kind, data, 0, ch);     assert(_PyUnicode_CheckConsistency(res, 1));     return res; }

Walking from the top, we'll have some checks. These are boring. Then some assigns, which should also be boring. The first interesting line is

ch = PyUnicode_READ(kind, data, index);

but we'd hope that is fast, as we're reading from a contiguous C array by indexing it. The result, ch, will be less than 256 so we'll return the cached character in get_latin1_char(ch).

So we'll run (dropping the first checks)

kind = PyUnicode_KIND(self); data = PyUnicode_DATA(self); ch = PyUnicode_READ(kind, data, index); return get_latin1_char(ch);

Where

#define PyUnicode_KIND(op) \     (assert(PyUnicode_Check(op)), \      assert(PyUnicode_IS_READY(op)),            \      ((PyASCIIObject *)(op))->state.kind)

(which is boring because asserts get ignored in debug [so I can check that they're fast] and ((PyASCIIObject *)(op))->state.kind) is (I think) an indirection and a C-level cast);

#define PyUnicode_DATA(op) \     (assert(PyUnicode_Check(op)), \      PyUnicode_IS_COMPACT(op) ? _PyUnicode_COMPACT_DATA(op) :   \      _PyUnicode_NONCOMPACT_DATA(op))

(which is also boring for similar reasons, assuming the macros (Something_CAPITALIZED) are all fast),

#define PyUnicode_READ(kind, data, index) \     ((Py_UCS4) \     ((kind) == PyUnicode_1BYTE_KIND ? \         ((const Py_UCS1 *)(data))[(index)] : \         ((kind) == PyUnicode_2BYTE_KIND ? \             ((const Py_UCS2 *)(data))[(index)] : \             ((const Py_UCS4 *)(data))[(index)] \         ) \     ))

(which involves indexes but really isn't slow at all) and

static PyObject* get_latin1_char(unsigned char ch) {     PyObject *unicode = unicode_latin1[ch];     if (!unicode) {         unicode = PyUnicode_New(1, ch);         if (!unicode)             return NULL;         PyUnicode_1BYTE_DATA(unicode)[0] = ch;         assert(_PyUnicode_CheckConsistency(unicode, 1));         unicode_latin1[ch] = unicode;     }     Py_INCREF(unicode);     return unicode; }

Which confirms my suspicion that:

This is cached:

PyObject *unicode = unicode_latin1[ch];

This should be fast. The if (!unicode) is not run, so it's literally equivalent in this case to
```
PyObject *unicode = unicode_latin1[ch]; Py_INCREF(unicode); return unicode; 
```

Honestly, after testing the asserts are fast (by disabling them [I think it works on the C-level asserts...]), the only plausibly-slow parts are:

PyUnicode_IS_COMPACT(op) _PyUnicode_COMPACT_DATA(op) _PyUnicode_NONCOMPACT_DATA(op)

Which are:

#define PyUnicode_IS_COMPACT(op) \     (((PyASCIIObject*)(op))->state.compact)

(fast, as before),

#define _PyUnicode_COMPACT_DATA(op)                     \     (PyUnicode_IS_ASCII(op) ?                   \      ((void*)((PyASCIIObject*)(op) + 1)) :              \      ((void*)((PyCompactUnicodeObject*)(op) + 1)))

(fast if the macro IS_ASCII is fast), and

#define _PyUnicode_NONCOMPACT_DATA(op)                  \     (assert(((PyUnicodeObject*)(op))->data.any),        \      ((((PyUnicodeObject *)(op))->data.any)))

(also fast as it's an assert plus an indirection plus a cast).

So we're down (the rabbit hole) to:

PyUnicode_IS_ASCII

which is

#define PyUnicode_IS_ASCII(op)                   \     (assert(PyUnicode_Check(op)),                \      assert(PyUnicode_IS_READY(op)),             \      ((PyASCIIObject*)op)->state.ascii)

Hmm... that seems fast too...

Well, OK, but let's compare it to PyList_GetItem. (Yeah, thanks Tim Peters for giving me more work to do :P.)

PyObject * PyList_GetItem(PyObject *op, Py_ssize_t i) {     if (!PyList_Check(op)) {         PyErr_BadInternalCall();         return NULL;     }     if (i < 0 || i >= Py_SIZE(op)) {         if (indexerr == NULL) {             indexerr = PyUnicode_FromString(                 "list index out of range");             if (indexerr == NULL)                 return NULL;         }         PyErr_SetObject(PyExc_IndexError, indexerr);         return NULL;     }     return ((PyListObject *)op) -> ob_item[i]; }

We can see that on non-error cases this is just going to run:

PyList_Check(op) Py_SIZE(op) ((PyListObject *)op) -> ob_item[i]

Where PyList_Check is

#define PyList_Check(op) \      PyType_FastSubclass(Py_TYPE(op), Py_TPFLAGS_LIST_SUBCLASS)

~~(TABS! TABS!!!) (issue21587)~~ That got fixed and merged in 5 minutes. Like... yeah. Damn. They put Skeet to shame.

#define Py_SIZE(ob)             (((PyVarObject*)(ob))->ob_size)

#define PyType_FastSubclass(t,f)  PyType_HasFeature(t,f)

#ifdef Py_LIMITED_API #define PyType_HasFeature(t,f)  ((PyType_GetFlags(t) & (f)) != 0) #else #define PyType_HasFeature(t,f)  (((t)->tp_flags & (f)) != 0) #endif

So this is normally really trivial (two indirections and a couple of boolean checks) unless Py_LIMITED_API is on, in which case... ???

Then there's the indexing and a cast (((PyListObject *)op) -> ob_item[i]) and we're done.

So there are definitely fewer checks for lists, and the small speed differences certainly imply that it could be relevant.

I think in general, there's just more type-checking and indirection (->) for Unicode. It seems I'm missing a point, but what?

132

answered Sep 29 '22 04:09

Veedrac

When you iterate over most container objects (lists, tuples, dicts, ...), the iterator delivers the objects in the container.

But when you iterate over a string, a new object has to be created for each character delivered - a string is not "a container" in the same sense a list is a container. The individual characters in a string don't exist as distinct objects before iteration creates those objects.

answered Sep 29 '22 05:09

Tim Peters

Related questions
                            
                                python: how to send mail with TO, CC and BCC?
                            
                                Subprocess changing directory
                            
                                Pycharm does not show plot
                            
                                How to extract a floating number from a string [duplicate]
                            
                                Efficient way to remove keys with empty strings from a dict
                            
                                Specify date format for Python argparse input arguments
                            
                                Can't install PIL after Mac OS X 10.9
                            
                                Add column in dataframe from list
                            
                                pyvenv not working because ensurepip is not available
                            
                                Python garbage collector documentation [closed]
                            
                                How to postpone/defer the evaluation of f-strings?
                            
                                What does the 'u' symbol mean in front of string values? [duplicate]
                            
                                What's a good IDE for Python on Mac OS X? [closed]
                            
                                Practicing BDD with python [closed]
                            
                                Finding the index of elements based on a condition using python list comprehension
                            
                                Automatically import modules when entering the python or ipython interpreter
                            
                                Difference between Variable and get_variable in TensorFlow
                            
                                How to escape curly-brackets in f-strings? [duplicate]
                            
                                Importing from a relative path in Python
                            
                                Dynamically updating plot in matplotlib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With