In Python the interface of an iterable is a subset of the iterator interface. This has the advantage that in many cases they can be treated in the same way. However, there is an important semantic difference between the two, since for an iterable __iter__
returns a new iterator object and not just self
. How can I test that an iterable is really an iterable and not an iterator? Conceptually I understand iterables to be collections, while an iterator only manages the iteration (i.e. keeps track of the position) but is not a collection itself.
The difference is for example important when one wants to loop multiple times. If an iterator is given then the second loop will not work since the iterator was already used up and directly raises StopIteration
.
It is tempting to test for a next
method, but this seems dangerous and somehow wrong. Should I just check that the second loop was empty?
Is there any way to do such a test in a more pythonic way? I know that this sound like a classic case of LBYL against EAFP, so maybe I should just give up? Or am I missing something?
Edit: S.Lott says in his answer below that this is primarily a problem of wanting to do multiple passes over the iterator, and that one should not do this in the first place. However, in my case the data is very large and depending on the situation has to be passed over multiple times for data processing (there is absolutely no way around this).
The iterable is also provided by the user, and for situations where a single pass is enough it will work with an iterator (e.g. created by a generator for simplicity). But it would be nice to safeguard against the case were a user provides only an iterator when multiple passes are needed.
Edit 2:
Actually this is a very nice Example for Abstract Base Classes. The __iter__
methods in an iterator and an iterable have the same name but are sematically different! So hasattr
is useless, but isinstance
provides a clean solution.
'iterator' if obj is iter(obj) else 'iterable'
However, there is an important semantic difference between the two...
Not really semantic or important. They're both iterable -- they both work with a for statement.
The difference is for example important when one wants to loop multiple times.
When does this ever come up? You'll have to be more specific. In the rare cases when you need to make two passes through an iterable collection, there are often better algorithms.
For example, let's say you're processing a list. You can iterate through a list all you want. Why did you get tangled up with an iterator instead of the iterable? Okay that didn't work.
Okay, here's one. You're reading a file in two passes, and you need to know how to reset the iterable. In this case, it's a file, and seek
is required; or a close and a reopen. That feels icky. You can readlines
to get a list which allows two passes with no complexity. So that's not necessary.
Wait, what if we have a file so big we can't read it all into memory? And, for obscure reasons, we can't seek, either. What then?
Now, we're down to the nitty-gritty of two passes. On the first pass, we accumulated something. An index or a summary or something. An index has all the file's data. A summary, often, is a restructuring of the data. With a small change from "summary" to "restructure", we've preserved the file's data in the new structure. In both cases, we don't need the file -- we can use the index or the summary.
All "two-pass" algorithms can be changed to one pass of the original iterator or iterable and a second pass of a different data structure.
This is neither LYBL or EAFP. This is algorithm design. You don't need to reset an iterator -- YAGNI.
Edit
Here's an example of an iterator/iterable issue. It's simply a poorly-designed algorithm.
it = iter(xrange(3))
for i in it: print i,; #prints 1,2,3
for i in it: print i,; #prints nothing
This is trivially fixed.
it = range(3)
for i in it: print i
for i in it: print i
The "multiple times in parallel" is trivially fixed. Write an API that requires an iterable. And when someone refuses to read the API documentation or refuses to follow it after having read it, their stuff breaks. As it should.
The "nice to safeguard against the case were a user provides only an iterator when multiple passes are needed" are both examples of insane people writing code that breaks our simple API.
If someone is insane enough to read most (but not all of the API doc) and provide an iterator when an iterable was required, you need to find this person and teach them (1) how to read all the API documentation and (2) follow the API documentation.
The "safeguard" issue isn't very realistic. These crazy programmers are remarkably rare. And in the few cases when it does arise, you know who they are and can help them.
Edit 2
The "we have to read the same structure multiple times" algorithms are a fundamental problem.
Do not do this.
for element in someBigIterable:
function1( element )
for element in someBigIterable:
function2( element )
...
Do this, instead.
for element in someBigIterable:
function1( element )
function2( element )
...
Or, consider something like this.
for element in someBigIterable:
for f in ( function1, function2, function3, ... ):
f( element )
In most cases, this kind of "pivot" of your algorithms results in a program that might be easier to optimize and might be a net improvement in performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With