If I want the number of items in an iterable without caring about the elements themselves, what would be the pythonic way to get that? Right now, I would define <pre class="prettyprint"><code>def ilen(it): return sum(itertools.imap(lambda _: 1, it)) # or just map in Python 3 </code></pre> but I understand <code>lambda</code> is close to being considered harmful, and <code>lambda _: 1</code> certainly isn't pretty. (The use case of this is counting the number of lines in a text file matching a regex, i.e. <code>grep -c</code>.)

Calls to <code>itertools.imap()</code> in Python 2 or <code>map()</code> in Python 3 can be replaced by equivalent generator expressions: <pre class="prettyprint"><code>sum(1 for dummy in it) </code></pre> This also uses a lazy generator, so it avoids materializing a full list of all iterator elements in memory.

Method that's meaningfully faster than <code>sum(1 for i in it)</code> when the iterable may be long (and not meaningfully slower when the iterable is short), while maintaining fixed memory overhead behavior (unlike <code>len(list(it))</code>) to avoid swap thrashing and reallocation overhead for larger inputs: <pre class="prettyprint"><code># On Python 2 only, get zip that lazily generates results instead of returning list from future_builtins import zip from collections import deque from itertools import count # Avoid constructing a deque each time, reduces fixed overhead enough # that this beats the sum solution for all but length 0-1 inputs consumeall = deque(maxlen=0).extend def ilen(it): # Make a stateful counting iterator cnt = count() # zip it with the input iterator, then drain until input exhausted at C level consumeall(zip(it, cnt)) # cnt must be second zip arg to avoid advancing too far # Since count 0 based, the next value is the count return next(cnt) </code></pre> Like <code>len(list(it))</code> it performs the loop in C code on CPython (<code>deque</code>, <code>count</code> and <code>zip</code> are all implemented in C); avoiding byte code execution per loop is usually the key to performance in CPython. It's surprisingly difficult to come up with fair test cases for comparing performance (<code>list</code> cheats using <code>__length_hint__</code> which isn't likely to be available for arbitrary input iterables, <code>itertools</code> functions that don't provide <code>__length_hint__</code> often have special operating modes that work faster when the value returned on each loop is released/freed before the next value is requested, which <code>deque</code> with <code>maxlen=0</code> will do). The test case I used was to create a generator function that would take an input and return a C level generator that lacked special <code>itertools</code> return container optimizations or <code>__length_hint__</code>, using Python 3.3+'s <code>yield from</code>: <pre class="prettyprint"><code>def no_opt_iter(it): yield from it </code></pre> Then using <code>ipython</code> <code>%timeit</code> magic (substituting different constants for 100): <pre class="prettyprint"><code>>>> %%timeit fakeinput = (0,) * 100 ... ilen(no_opt_iter(fakeinput)) </code></pre> When the input isn't large enough that <code>len(list(it))</code> would cause memory issues, on a Linux box running Python 3.9 x64, my solution takes about 50% longer than <code>def ilen(it): return len(list(it))</code>, regardless of input length. For the smallest of inputs, the setup costs to load/call <code>consumeall</code>/<code>zip</code>/<code>count</code>/<code>next</code> means it takes infinitesimally longer this way than <code>def ilen(it): sum(1 for _ in it)</code> (about 40 ns more on my machine for a length 0 input, a 10% increase over the simple <code>sum</code> approach), but by the time you hit length 2 inputs, the cost is equivalent, and somewhere around length 30, the initial overhead is unnoticeable compared to the real work; the <code>sum</code> approach takes roughly 50% longer. Basically, if memory use matters or inputs don't have bounded size and you care about speed more than brevity, use this solution. If inputs are bounded and smallish, <code>len(list(it))</code> is probably best, and if they're unbounded, but simplicity/brevity counts, you'd use <code>sum(1 for _ in it)</code>.

What's the shortest way to count the number of items in a generator/iterator?

Tags:

python

iterator

generator

iterable

If I want the number of items in an iterable without caring about the elements themselves, what would be the pythonic way to get that? Right now, I would define

def ilen(it):     return sum(itertools.imap(lambda _: 1, it))    # or just map in Python 3

but I understand lambda is close to being considered harmful, and lambda _: 1 certainly isn't pretty.

(The use case of this is counting the number of lines in a text file matching a regex, i.e. grep -c.)

452

asked Mar 21 '11 22:03

Fred Foo

2 Answers

Calls to itertools.imap() in Python 2 or map() in Python 3 can be replaced by equivalent generator expressions:

sum(1 for dummy in it)

This also uses a lazy generator, so it avoids materializing a full list of all iterator elements in memory.

answered Oct 20 '22 01:10

Sven Marnach

Method that's meaningfully faster than sum(1 for i in it) when the iterable may be long (and not meaningfully slower when the iterable is short), while maintaining fixed memory overhead behavior (unlike len(list(it))) to avoid swap thrashing and reallocation overhead for larger inputs:

# On Python 2 only, get zip that lazily generates results instead of returning list from future_builtins import zip  from collections import deque from itertools import count  # Avoid constructing a deque each time, reduces fixed overhead enough # that this beats the sum solution for all but length 0-1 inputs consumeall = deque(maxlen=0).extend  def ilen(it):     # Make a stateful counting iterator     cnt = count()     # zip it with the input iterator, then drain until input exhausted at C level     consumeall(zip(it, cnt)) # cnt must be second zip arg to avoid advancing too far     # Since count 0 based, the next value is the count     return next(cnt)

Like len(list(it)) it performs the loop in C code on CPython (deque, count and zip are all implemented in C); avoiding byte code execution per loop is usually the key to performance in CPython.

It's surprisingly difficult to come up with fair test cases for comparing performance (list cheats using __length_hint__ which isn't likely to be available for arbitrary input iterables, itertools functions that don't provide __length_hint__ often have special operating modes that work faster when the value returned on each loop is released/freed before the next value is requested, which deque with maxlen=0 will do). The test case I used was to create a generator function that would take an input and return a C level generator that lacked special itertools return container optimizations or __length_hint__, using Python 3.3+'s yield from:

def no_opt_iter(it):     yield from it

Then using ipython %timeit magic (substituting different constants for 100):

>>> %%timeit fakeinput = (0,) * 100 ... ilen(no_opt_iter(fakeinput))

When the input isn't large enough that len(list(it)) would cause memory issues, on a Linux box running Python 3.9 x64, my solution takes about 50% longer than def ilen(it): return len(list(it)), regardless of input length.

For the smallest of inputs, the setup costs to load/call consumeall/zip/count/next means it takes infinitesimally longer this way than def ilen(it): sum(1 for _ in it) (about 40 ns more on my machine for a length 0 input, a 10% increase over the simple sum approach), but by the time you hit length 2 inputs, the cost is equivalent, and somewhere around length 30, the initial overhead is unnoticeable compared to the real work; the sum approach takes roughly 50% longer.

Basically, if memory use matters or inputs don't have bounded size and you care about speed more than brevity, use this solution. If inputs are bounded and smallish, len(list(it)) is probably best, and if they're unbounded, but simplicity/brevity counts, you'd use sum(1 for _ in it).

answered Oct 20 '22 00:10

ShadowRanger

Related questions
                            
                                Putting newline in matplotlib label with TeX in Python?
                            
                                Changing file permission in Python
                            
                                Protobuf to json in python
                            
                                Thread local storage in Python
                            
                                appending list but error 'NoneType' object has no attribute 'append' [duplicate]
                            
                                Python - Get Yesterday's date as a string in YYYY-MM-DD format
                            
                                Retry Celery tasks with exponential back off
                            
                                Unable to install Pygame using pip
                            
                                How to prevent my site page to be loaded via 3rd party site frame of iFrame
                            
                                Division in Python 2.7. and 3.3 [duplicate]
                            
                                Python urllib2 Basic Auth Problem
                            
                                Pretty print 2D list?
                            
                                Django annotate count with a distinct field
                            
                                Python Finding Prime Factors
                            
                                Plotting images side by side using matplotlib
                            
                                How to create a spinning command line cursor?
                            
                                Pandas get the most frequent values of a column
                            
                                Add one year in current date PYTHON
                            
                                Python: determine if all items of a list are the same item [duplicate]
                            
                                "WARNING: Value for scheme.data does not match" when I try to update pip or install packages

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With