If I want the number of items in an iterable without caring about the elements themselves, what would be the pythonic way to get that? Right now, I would define
def ilen(it): return sum(itertools.imap(lambda _: 1, it)) # or just map in Python 3
but I understand lambda
is close to being considered harmful, and lambda _: 1
certainly isn't pretty.
(The use case of this is counting the number of lines in a text file matching a regex, i.e. grep -c
.)
This is because generators, like all iterators, can be exhausted. Unless your generator is infinite, you can iterate through it one time only. Once all values have been evaluated, iteration will stop and the for loop will exit.
Iterators are the objects that use the next() method to get the next value of the sequence. A generator is a function that produces or yields a sequence of values using a yield statement. Classes are used to Implement the iterators. Functions are used to implement the generator. Every iterator is not a generator.
So lists, tuples, sets, dictionaries... are iterators. But those are not generators, because all of the elements they contain are defined and evaluated after the container initialization, and they can be iterated over many times. Therefore, some iterators are not generators.
What Is Yield In Python? The Yield keyword in Python is similar to a return statement used for returning values or objects in Python. However, there is a slight difference. The yield statement returns a generator object to the one who calls the function which contains yield, instead of simply returning a value.
Calls to itertools.imap()
in Python 2 or map()
in Python 3 can be replaced by equivalent generator expressions:
sum(1 for dummy in it)
This also uses a lazy generator, so it avoids materializing a full list of all iterator elements in memory.
Method that's meaningfully faster than sum(1 for i in it)
when the iterable may be long (and not meaningfully slower when the iterable is short), while maintaining fixed memory overhead behavior (unlike len(list(it))
) to avoid swap thrashing and reallocation overhead for larger inputs:
# On Python 2 only, get zip that lazily generates results instead of returning list from future_builtins import zip from collections import deque from itertools import count # Avoid constructing a deque each time, reduces fixed overhead enough # that this beats the sum solution for all but length 0-1 inputs consumeall = deque(maxlen=0).extend def ilen(it): # Make a stateful counting iterator cnt = count() # zip it with the input iterator, then drain until input exhausted at C level consumeall(zip(it, cnt)) # cnt must be second zip arg to avoid advancing too far # Since count 0 based, the next value is the count return next(cnt)
Like len(list(it))
it performs the loop in C code on CPython (deque
, count
and zip
are all implemented in C); avoiding byte code execution per loop is usually the key to performance in CPython.
It's surprisingly difficult to come up with fair test cases for comparing performance (list
cheats using __length_hint__
which isn't likely to be available for arbitrary input iterables, itertools
functions that don't provide __length_hint__
often have special operating modes that work faster when the value returned on each loop is released/freed before the next value is requested, which deque
with maxlen=0
will do). The test case I used was to create a generator function that would take an input and return a C level generator that lacked special itertools
return container optimizations or __length_hint__
, using Python 3.3+'s yield from
:
def no_opt_iter(it): yield from it
Then using ipython
%timeit
magic (substituting different constants for 100):
>>> %%timeit fakeinput = (0,) * 100 ... ilen(no_opt_iter(fakeinput))
When the input isn't large enough that len(list(it))
would cause memory issues, on a Linux box running Python 3.9 x64, my solution takes about 50% longer than def ilen(it): return len(list(it))
, regardless of input length.
For the smallest of inputs, the setup costs to load/call consumeall
/zip
/count
/next
means it takes infinitesimally longer this way than def ilen(it): sum(1 for _ in it)
(about 40 ns more on my machine for a length 0 input, a 10% increase over the simple sum
approach), but by the time you hit length 2 inputs, the cost is equivalent, and somewhere around length 30, the initial overhead is unnoticeable compared to the real work; the sum
approach takes roughly 50% longer.
Basically, if memory use matters or inputs don't have bounded size and you care about speed more than brevity, use this solution. If inputs are bounded and smallish, len(list(it))
is probably best, and if they're unbounded, but simplicity/brevity counts, you'd use sum(1 for _ in it)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With