Many iterator "functions" in the __builtin__
module are actually implemented as types, even although the documentation talks about them as being "functions". Take for instance enumerate
. The documentation says that it is equivalent to:
def enumerate(sequence, start=0):
n = start
for elem in sequence:
yield n, elem
n += 1
Which is exactly as I would have implemented it, of course. However, I ran the following test with the previous definition, and got this:
>>> x = enumerate(range(10))
>>> x
<generator object enumerate at 0x01ED9F08>
Which is what I expect. However, when using the __builtin__
version, I get this:
>>> x = enumerate(range(10))
>>> x
<enumerate object at 0x01EE9EE0>
From this I infer that it is defined as
class enumerate:
def __init__(self, sequence, start=0):
# ....
def __iter__(self):
# ...
Rather than in the standard form the documentation shows. Now I can understand how this works, and how it is equivalent to the standard form, what I want to know is what is the reason to do it this way. Is it more efficient this way? Does it has something to do with these functions being implemented in C (I don't know if they are, but I suspect so)?
I'm using Python 2.7.2, just in case the difference is important.
Thanks in advance.
Yes, it has to do with the fact that built-ins are generally implemented in C. Really often C code will introduce new types instead of plain functions, as in the case of enumerate
.
Writing them in C provide finer control over them and often some performance improvements,
and since there is no real downside it's a natural choice.
Take into account that to write the equivalent of:
def enumerate(sequence, start=0):
n = start
for elem in sequence:
yield n, elem
n += 1
in C, i.e. a new instance of a generator, you should create a code object that contains the actual bytecode. This is not impossible, but is not so easier than writing a new type which simply implements __iter__
and __next__
calling the Python C-API, plus the other advantages of having a different type.
So, in the case of enumerate
and reversed
it's simply because it provides better performance, and it's more maintainable.
Other advantages include:
chain.from_iterable
). This could be done even with functions, but you'd have to first define them and then manually set the attributes, which doesn't look so clean.isinstance
on the iterables. This could allow some optimizations(e.g if you know that isinstance(iterable, itertools.repeat)
, then you may be able to optimize the code since you know which values will be yielded.Edit: Just to clarify what I mean by:
in C, i.e. a new instance of a generator, you should create a code object that contains the actual bytecode.
Looking at Objects/genobject.c
the only function to create a PyGen_Type
instance is PyGen_New
whose signature is:
PyObject *
PyGen_New(PyFrameObject *f)
Now, looking at Objects/frameobject.c
we can see that to create a PyFrameObject
you must call PyFrame_New
, which has this signature:
PyFrameObject *
PyFrame_New(PyThreadState *tstate, PyCodeObject *code, PyObject *globals,
PyObject *locals)
As you can see it requires a PyCodeObject
instance. PyCodeObject
s are how the python interpreter represents bytecode internally(e.g. a PyCodeObject
can represent the bytecode of a function), so: yes, to create a PyGen_Type
instance from C you must manually create the bytecode, and it's not so easy to create PyCodeObject
s since PyCode_New
has this signature:
PyCodeObject *
PyCode_New(int argcount, int kwonlyargcount,
int nlocals, int stacksize, int flags,
PyObject *code, PyObject *consts, PyObject *names,
PyObject *varnames, PyObject *freevars, PyObject *cellvars,
PyObject *filename, PyObject *name, int firstlineno,
PyObject *lnotab)
Note how it contains arguments such as firstlineno
, filename
which are obviously meant to be obtained by python source and not from other C code. Obviously you can create it in C, but I'm not at all sure that it would require less characters than writing a simple new type.
Yes, they're implemented in C. They use the C API for iterators (PEP 234), in which iterators are defined by creating new types that have the tp_iternext
slot.
The functions that are created by the generator function syntax (yield
) are 'magical' functions that return a special generator object. These are instances of types.GeneratorType
, which you cannot manually create. If a different library that uses the C API defines its own iterator type, it won't be an instance of GeneratorType
, but it'll still implement the C API iterator protocol.
Therefore, the enumerate
type is a distinct type that is different from GeneratorType
, and you can use it like any other type, with isinstance
and such (although you shouldn't).
Unlike Bakuriu's answer, enumerate
isn't a generator, so there's no bytecode/frames.
$ grep -i 'frame\|gen' Objects/enumobject.c
PyObject_GenericGetAttr, /* tp_getattro */
PyType_GenericAlloc, /* tp_alloc */
PyObject_GenericGetAttr, /* tp_getattro */
PyType_GenericAlloc, /* tp_alloc */
Instead, the way you create a new enumobject is with the function enum_new
, whose signature doesn't use a frame
static PyObject *
enum_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
This function is placed within the tp_new
slot of the PyEnum_Type
struct (of type PyTypeObject
). Here, we also see that the tp_iternext
slot is occupied by the enum_next
function, which contains straightforward C code that gets the next item of the iterator it's enumerating over, and then returns a PyObject (a tuple).
Moving on, PyEnum_Type
is then placed into the builtin module (Python/bltinmodule.c
) with the name enumerate
, so that it is publicly accessible.
No bytecode needed. Pure C. Much more efficient than any pure python or generatortype
implementation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With