Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is it slower to iterate over a small string than a small list?

I was playing around with timeit and noticed that doing a simple list comprehension over a small string took longer than doing the same operation on a list of small single character strings. Any explanation? It's almost 1.35 times as much time.

>>> from timeit import timeit >>> timeit("[x for x in 'abc']") 2.0691067844831528 >>> timeit("[x for x in ['a', 'b', 'c']]") 1.5286479570345861 

What's happening on a lower level that's causing this?

like image 345
Sunjay Varma Avatar asked May 26 '14 01:05

Sunjay Varma


People also ask

Is it faster to iterate over a list or a set?

Iterating over a List is much much faster than iterating over a set. The currently accepted answer is using a very small set and list and hence, the difference is negligible there.

Why are sets faster than lists Python?

Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1).

What is faster than a list Python?

Creating a tuple is faster than creating a list. Creating a list is slower because two memory blocks need to be accessed. An element in a tuple cannot be removed or replaced.

Is set slower than list?

Lists are slightly faster than sets when you just want to iterate over the values. Sets, however, are significantly faster than lists if you want to check if an item is contained within it. They can only contain unique items though.


2 Answers

TL;DR

  • The actual speed difference is closer to 70% (or more) once a lot of the overhead is removed, for Python 2.

  • Object creation is not at fault. Neither method creates a new object, as one-character strings are cached.

  • The difference is unobvious, but is likely created from a greater number of checks on string indexing, with regards to the type and well-formedness. It is also quite likely thanks to the need to check what to return.

  • List indexing is remarkably fast.



>>> python3 -m timeit '[x for x in "abc"]' 1000000 loops, best of 3: 0.388 usec per loop  >>> python3 -m timeit '[x for x in ["a", "b", "c"]]' 1000000 loops, best of 3: 0.436 usec per loop 

This disagrees with what you've found...

You must be using Python 2, then.

>>> python2 -m timeit '[x for x in "abc"]' 1000000 loops, best of 3: 0.309 usec per loop  >>> python2 -m timeit '[x for x in ["a", "b", "c"]]' 1000000 loops, best of 3: 0.212 usec per loop 

Let's explain the difference between the versions. I'll examine the compiled code.

For Python 3:

import dis  def list_iterate():     [item for item in ["a", "b", "c"]]  dis.dis(list_iterate) #>>>   4           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d06b118a0, file "", line 4>) #>>>               3 LOAD_CONST               2 ('list_iterate.<locals>.<listcomp>') #>>>               6 MAKE_FUNCTION            0 #>>>               9 LOAD_CONST               3 ('a') #>>>              12 LOAD_CONST               4 ('b') #>>>              15 LOAD_CONST               5 ('c') #>>>              18 BUILD_LIST               3 #>>>              21 GET_ITER #>>>              22 CALL_FUNCTION            1 (1 positional, 0 keyword pair) #>>>              25 POP_TOP #>>>              26 LOAD_CONST               0 (None) #>>>              29 RETURN_VALUE  def string_iterate():     [item for item in "abc"]  dis.dis(string_iterate) #>>>  21           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d06b17150, file "", line 21>) #>>>               3 LOAD_CONST               2 ('string_iterate.<locals>.<listcomp>') #>>>               6 MAKE_FUNCTION            0 #>>>               9 LOAD_CONST               3 ('abc') #>>>              12 GET_ITER #>>>              13 CALL_FUNCTION            1 (1 positional, 0 keyword pair) #>>>              16 POP_TOP #>>>              17 LOAD_CONST               0 (None) #>>>              20 RETURN_VALUE 

You see here that the list variant is likely to be slower due to the building of the list each time.

This is the

 9 LOAD_CONST   3 ('a') 12 LOAD_CONST   4 ('b') 15 LOAD_CONST   5 ('c') 18 BUILD_LIST   3 

part. The string variant only has

 9 LOAD_CONST   3 ('abc') 

You can check that this does seem to make a difference:

def string_iterate():     [item for item in ("a", "b", "c")]  dis.dis(string_iterate) #>>>  35           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d068be660, file "", line 35>) #>>>               3 LOAD_CONST               2 ('string_iterate.<locals>.<listcomp>') #>>>               6 MAKE_FUNCTION            0 #>>>               9 LOAD_CONST               6 (('a', 'b', 'c')) #>>>              12 GET_ITER #>>>              13 CALL_FUNCTION            1 (1 positional, 0 keyword pair) #>>>              16 POP_TOP #>>>              17 LOAD_CONST               0 (None) #>>>              20 RETURN_VALUE 

This produces just

 9 LOAD_CONST               6 (('a', 'b', 'c')) 

as tuples are immutable. Test:

>>> python3 -m timeit '[x for x in ("a", "b", "c")]' 1000000 loops, best of 3: 0.369 usec per loop 

Great, back up to speed.

For Python 2:

def list_iterate():     [item for item in ["a", "b", "c"]]  dis.dis(list_iterate) #>>>   2           0 BUILD_LIST               0 #>>>               3 LOAD_CONST               1 ('a') #>>>               6 LOAD_CONST               2 ('b') #>>>               9 LOAD_CONST               3 ('c') #>>>              12 BUILD_LIST               3 #>>>              15 GET_ITER             #>>>         >>   16 FOR_ITER                12 (to 31) #>>>              19 STORE_FAST               0 (item) #>>>              22 LOAD_FAST                0 (item) #>>>              25 LIST_APPEND              2 #>>>              28 JUMP_ABSOLUTE           16 #>>>         >>   31 POP_TOP              #>>>              32 LOAD_CONST               0 (None) #>>>              35 RETURN_VALUE          def string_iterate():     [item for item in "abc"]  dis.dis(string_iterate) #>>>   2           0 BUILD_LIST               0 #>>>               3 LOAD_CONST               1 ('abc') #>>>               6 GET_ITER             #>>>         >>    7 FOR_ITER                12 (to 22) #>>>              10 STORE_FAST               0 (item) #>>>              13 LOAD_FAST                0 (item) #>>>              16 LIST_APPEND              2 #>>>              19 JUMP_ABSOLUTE            7 #>>>         >>   22 POP_TOP              #>>>              23 LOAD_CONST               0 (None) #>>>              26 RETURN_VALUE         

The odd thing is that we have the same building of the list, but it's still faster for this. Python 2 is acting strangely fast.

Let's remove the comprehensions and re-time. The _ = is to prevent it getting optimised out.

>>> python3 -m timeit '_ = ["a", "b", "c"]' 10000000 loops, best of 3: 0.0707 usec per loop  >>> python3 -m timeit '_ = "abc"' 100000000 loops, best of 3: 0.0171 usec per loop 

We can see that initialization is not significant enough to account for the difference between the versions (those numbers are small)! We can thus conclude that Python 3 has slower comprehensions. This makes sense as Python 3 changed comprehensions to have safer scoping.

Well, now improve the benchmark (I'm just removing overhead that isn't iteration). This removes the building of the iterable by pre-assigning it:

>>> python3 -m timeit -s 'iterable = "abc"'           '[x for x in iterable]' 1000000 loops, best of 3: 0.387 usec per loop  >>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]' 1000000 loops, best of 3: 0.368 usec per loop 
>>> python2 -m timeit -s 'iterable = "abc"'           '[x for x in iterable]' 1000000 loops, best of 3: 0.309 usec per loop  >>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]' 10000000 loops, best of 3: 0.164 usec per loop 

We can check if calling iter is the overhead:

>>> python3 -m timeit -s 'iterable = "abc"'           'iter(iterable)' 10000000 loops, best of 3: 0.099 usec per loop  >>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)' 10000000 loops, best of 3: 0.1 usec per loop 
>>> python2 -m timeit -s 'iterable = "abc"'           'iter(iterable)' 10000000 loops, best of 3: 0.0913 usec per loop  >>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)' 10000000 loops, best of 3: 0.0854 usec per loop 

No. No it is not. The difference is too small, especially for Python 3.

So let's remove yet more unwanted overhead... by making the whole thing slower! The aim is just to have a longer iteration so the time hides overhead.

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]' 100 loops, best of 3: 3.12 msec per loop  >>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]' 100 loops, best of 3: 2.77 msec per loop 
>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]' 100 loops, best of 3: 2.32 msec per loop  >>> python2 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]' 100 loops, best of 3: 2.09 msec per loop 

This hasn't actually changed much, but it's helped a little.

So remove the comprehension. It's overhead that's not part of the question:

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass' 1000 loops, best of 3: 1.71 msec per loop  >>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass' 1000 loops, best of 3: 1.36 msec per loop 
>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass' 1000 loops, best of 3: 1.27 msec per loop  >>> python2 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass' 1000 loops, best of 3: 935 usec per loop 

That's more like it! We can get slightly faster still by using deque to iterate. It's basically the same, but it's faster:

>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 777 usec per loop  >>> python3 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 405 usec per loop 
>>> python2 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 805 usec per loop  >>> python2 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 438 usec per loop 

What impresses me is that Unicode is competitive with bytestrings. We can check this explicitly by trying bytes and unicode in both:

  • bytes

    >>> python3 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127)).encode("ascii") for _ in range(100000))' 'deque(iterable, maxlen=0)'                                                                    :( 1000 loops, best of 3: 571 usec per loop  >>> python3 -m timeit -s 'import random; from collections import deque; iterable =         [chr(random.randint(0, 127)).encode("ascii") for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 394 usec per loop 
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127))                 for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 757 usec per loop  >>> python2 -m timeit -s 'import random; from collections import deque; iterable =         [chr(random.randint(0, 127))                 for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 438 usec per loop 

    Here you see Python 3 actually faster than Python 2.

  • unicode

    >>> python3 -m timeit -s 'import random; from collections import deque; iterable = u"".join(   chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 800 usec per loop  >>> python3 -m timeit -s 'import random; from collections import deque; iterable =         [   chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 394 usec per loop 
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable = u"".join(unichr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 1.07 msec per loop  >>> python2 -m timeit -s 'import random; from collections import deque; iterable =         [unichr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 469 usec per loop 

    Again, Python 3 is faster, although this is to be expected (str has had a lot of attention in Python 3).

In fact, this unicode-bytes difference is very small, which is impressive.

So let's analyse this one case, seeing as it's fast and convenient for me:

>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 777 usec per loop  >>> python3 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)' 1000 loops, best of 3: 405 usec per loop 

We can actually rule out Tim Peter's 10-times-upvoted answer!

>>> foo = iterable[123] >>> iterable[36] is foo True 

These are not new objects!

But this is worth mentioning: indexing costs. The difference will likely be in the indexing, so remove the iteration and just index:

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'iterable[123]' 10000000 loops, best of 3: 0.0397 usec per loop  >>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable[123]' 10000000 loops, best of 3: 0.0374 usec per loop 

The difference seems small, but at least half of the cost is overhead:

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable; 123' 100000000 loops, best of 3: 0.0173 usec per loop 

so the speed difference is sufficient to decide to blame it. I think.

So why is indexing a list so much faster?

Well, I'll come back to you on that, but my guess is that's is down to the check for interned strings (or cached characters if it's a separate mechanism). This will be less fast than optimal. But I'll go check the source (although I'm not comfortable in C...) :).


So here's the source:

static PyObject * unicode_getitem(PyObject *self, Py_ssize_t index) {     void *data;     enum PyUnicode_Kind kind;     Py_UCS4 ch;     PyObject *res;      if (!PyUnicode_Check(self) || PyUnicode_READY(self) == -1) {         PyErr_BadArgument();         return NULL;     }     if (index < 0 || index >= PyUnicode_GET_LENGTH(self)) {         PyErr_SetString(PyExc_IndexError, "string index out of range");         return NULL;     }     kind = PyUnicode_KIND(self);     data = PyUnicode_DATA(self);     ch = PyUnicode_READ(kind, data, index);     if (ch < 256)         return get_latin1_char(ch);      res = PyUnicode_New(1, ch);     if (res == NULL)         return NULL;     kind = PyUnicode_KIND(res);     data = PyUnicode_DATA(res);     PyUnicode_WRITE(kind, data, 0, ch);     assert(_PyUnicode_CheckConsistency(res, 1));     return res; } 

Walking from the top, we'll have some checks. These are boring. Then some assigns, which should also be boring. The first interesting line is

ch = PyUnicode_READ(kind, data, index); 

but we'd hope that is fast, as we're reading from a contiguous C array by indexing it. The result, ch, will be less than 256 so we'll return the cached character in get_latin1_char(ch).

So we'll run (dropping the first checks)

kind = PyUnicode_KIND(self); data = PyUnicode_DATA(self); ch = PyUnicode_READ(kind, data, index); return get_latin1_char(ch); 

Where

#define PyUnicode_KIND(op) \     (assert(PyUnicode_Check(op)), \      assert(PyUnicode_IS_READY(op)),            \      ((PyASCIIObject *)(op))->state.kind) 

(which is boring because asserts get ignored in debug [so I can check that they're fast] and ((PyASCIIObject *)(op))->state.kind) is (I think) an indirection and a C-level cast);

#define PyUnicode_DATA(op) \     (assert(PyUnicode_Check(op)), \      PyUnicode_IS_COMPACT(op) ? _PyUnicode_COMPACT_DATA(op) :   \      _PyUnicode_NONCOMPACT_DATA(op)) 

(which is also boring for similar reasons, assuming the macros (Something_CAPITALIZED) are all fast),

#define PyUnicode_READ(kind, data, index) \     ((Py_UCS4) \     ((kind) == PyUnicode_1BYTE_KIND ? \         ((const Py_UCS1 *)(data))[(index)] : \         ((kind) == PyUnicode_2BYTE_KIND ? \             ((const Py_UCS2 *)(data))[(index)] : \             ((const Py_UCS4 *)(data))[(index)] \         ) \     )) 

(which involves indexes but really isn't slow at all) and

static PyObject* get_latin1_char(unsigned char ch) {     PyObject *unicode = unicode_latin1[ch];     if (!unicode) {         unicode = PyUnicode_New(1, ch);         if (!unicode)             return NULL;         PyUnicode_1BYTE_DATA(unicode)[0] = ch;         assert(_PyUnicode_CheckConsistency(unicode, 1));         unicode_latin1[ch] = unicode;     }     Py_INCREF(unicode);     return unicode; } 

Which confirms my suspicion that:

  • This is cached:

    PyObject *unicode = unicode_latin1[ch]; 
  • This should be fast. The if (!unicode) is not run, so it's literally equivalent in this case to

    PyObject *unicode = unicode_latin1[ch]; Py_INCREF(unicode); return unicode; 

Honestly, after testing the asserts are fast (by disabling them [I think it works on the C-level asserts...]), the only plausibly-slow parts are:

PyUnicode_IS_COMPACT(op) _PyUnicode_COMPACT_DATA(op) _PyUnicode_NONCOMPACT_DATA(op) 

Which are:

#define PyUnicode_IS_COMPACT(op) \     (((PyASCIIObject*)(op))->state.compact) 

(fast, as before),

#define _PyUnicode_COMPACT_DATA(op)                     \     (PyUnicode_IS_ASCII(op) ?                   \      ((void*)((PyASCIIObject*)(op) + 1)) :              \      ((void*)((PyCompactUnicodeObject*)(op) + 1))) 

(fast if the macro IS_ASCII is fast), and

#define _PyUnicode_NONCOMPACT_DATA(op)                  \     (assert(((PyUnicodeObject*)(op))->data.any),        \      ((((PyUnicodeObject *)(op))->data.any))) 

(also fast as it's an assert plus an indirection plus a cast).

So we're down (the rabbit hole) to:

PyUnicode_IS_ASCII 

which is

#define PyUnicode_IS_ASCII(op)                   \     (assert(PyUnicode_Check(op)),                \      assert(PyUnicode_IS_READY(op)),             \      ((PyASCIIObject*)op)->state.ascii) 

Hmm... that seems fast too...


Well, OK, but let's compare it to PyList_GetItem. (Yeah, thanks Tim Peters for giving me more work to do :P.)

PyObject * PyList_GetItem(PyObject *op, Py_ssize_t i) {     if (!PyList_Check(op)) {         PyErr_BadInternalCall();         return NULL;     }     if (i < 0 || i >= Py_SIZE(op)) {         if (indexerr == NULL) {             indexerr = PyUnicode_FromString(                 "list index out of range");             if (indexerr == NULL)                 return NULL;         }         PyErr_SetObject(PyExc_IndexError, indexerr);         return NULL;     }     return ((PyListObject *)op) -> ob_item[i]; } 

We can see that on non-error cases this is just going to run:

PyList_Check(op) Py_SIZE(op) ((PyListObject *)op) -> ob_item[i] 

Where PyList_Check is

#define PyList_Check(op) \      PyType_FastSubclass(Py_TYPE(op), Py_TPFLAGS_LIST_SUBCLASS) 

(TABS! TABS!!!) (issue21587) That got fixed and merged in 5 minutes. Like... yeah. Damn. They put Skeet to shame.

#define Py_SIZE(ob)             (((PyVarObject*)(ob))->ob_size) 
#define PyType_FastSubclass(t,f)  PyType_HasFeature(t,f) 
#ifdef Py_LIMITED_API #define PyType_HasFeature(t,f)  ((PyType_GetFlags(t) & (f)) != 0) #else #define PyType_HasFeature(t,f)  (((t)->tp_flags & (f)) != 0) #endif 

So this is normally really trivial (two indirections and a couple of boolean checks) unless Py_LIMITED_API is on, in which case... ???

Then there's the indexing and a cast (((PyListObject *)op) -> ob_item[i]) and we're done.

So there are definitely fewer checks for lists, and the small speed differences certainly imply that it could be relevant.


I think in general, there's just more type-checking and indirection (->) for Unicode. It seems I'm missing a point, but what?

like image 132
Veedrac Avatar answered Sep 29 '22 04:09

Veedrac


When you iterate over most container objects (lists, tuples, dicts, ...), the iterator delivers the objects in the container.

But when you iterate over a string, a new object has to be created for each character delivered - a string is not "a container" in the same sense a list is a container. The individual characters in a string don't exist as distinct objects before iteration creates those objects.

like image 44
Tim Peters Avatar answered Sep 29 '22 05:09

Tim Peters