I have these two implementations to compute the length of a finite generator, while keeping the data for further processing:
def count_generator1(generator):
'''- build a list with the generator data
- get the length of the data
- return both the length and the original data (in a list)
WARNING: the memory use is unbounded, and infinite generators will block this'''
l = list(generator)
return len(l), l
def count_generator2(generator):
'''- get two generators from the original generator
- get the length of the data from one of them
- return both the length and the original data, as returned by tee
WARNING: tee can use up an unbounded amount of memory, and infinite generators will block this'''
for_length, saved = itertools.tee(generator, 2)
return sum(1 for _ in for_length), saved
Both have drawbacks, both do the job. Could somebody comment on them, or even offer a better alternative?
Python generators are a simple way of creating iterators. All the work we mentioned above are automatically handled by generators in Python. Simply speaking, a generator is a function that returns an object (iterator) which we can iterate over (one value at a time).
Yield is a keyword in Python that is used to return from a function without destroying the states of its local variable and when the function is called, the execution starts from the last yield statement. Any function that contains a yield keyword is termed a generator. Hence, yield is what makes a generator.
I ran Windows 64-bit Python 3.4.3 timeit
on a few approaches I could think of:
>>> from timeit import timeit
>>> from textwrap import dedent as d
>>> timeit(
... d("""
... count = -1
... for _ in s:
... count += 1
... count += 1
... """),
... "s = range(1000)",
... )
50.70772041983173
>>> timeit(
... d("""
... count = -1
... for count, _ in enumerate(s):
... pass
... count += 1
... """),
... "s = range(1000)",
... )
42.636973504498656
>>> timeit(
... d("""
... count, _ = reduce(f, enumerate(range(1000)), (-1, -1))
... count += 1
... """),
... d("""
... from functools import reduce
... def f(_, count):
... return count
... s = range(1000)
... """),
... )
121.15513102540672
>>> timeit("count = sum(1 for _ in s)", "s = range(1000)")
58.179126025925825
>>> timeit("count = len(tuple(s))", "s = range(1000)")
19.777029680237774
>>> timeit("count = len(list(s))", "s = range(1000)")
18.145157531932
>>> timeit("count = len(list(1 for _ in s))", "s = range(1000)")
57.41422175998332
Shockingly, the fastest approach was to use a list
(not even a tuple
) to exhaust the iterator and get the length from there:
>>> timeit("count = len(list(s))", "s = range(1000)")
18.145157531932
Of course, this risks memory issues. The best low-memory alternative was to use enumerate on a NOOP for
-loop:
>>> timeit(
... d("""
... count = -1
... for count, _ in enumerate(s):
... pass
... count += 1
... """),
... "s = range(1000)",
... )
42.636973504498656
Cheers!
If you have to do this, the first method is much better - as you consume all the values, itertools.tee()
will have to store all the values anyway, meaning a list will be more efficient.
To quote from the docs:
This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With