How do I yield an object from a generator and forget it immediately, so that it doesn't take up memory?
For example, in the following function:
def grouper(iterable, chunksize):
"""
Return elements from the iterable in `chunksize`-ed lists. The last returned
element may be smaller (if length of collection is not divisible by `chunksize`).
>>> print list(grouper(xrange(10), 3))
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
"""
i = iter(iterable)
while True:
chunk = list(itertools.islice(i, int(chunksize)))
if not chunk:
break
yield chunk
I don't want the function to hold on to the reference to chunk
after yielding it, as it is not used further and just consumes memory, even if all outside references are gone.
EDIT: using standard Python 2.5/2.6/2.7 from python.org.
Solution (proposed almost simultaneously by @phihag and @Owen): wrap the result in a (small) mutable object and return the chunk anonymously, leaving only the small container behind:
def chunker(iterable, chunksize):
"""
Return elements from the iterable in `chunksize`-ed lists. The last returned
chunk may be smaller (if length of collection is not divisible by `chunksize`).
>>> print list(chunker(xrange(10), 3))
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
"""
i = iter(iterable)
while True:
wrapped_chunk = [list(itertools.islice(i, int(chunksize)))]
if not wrapped_chunk[0]:
break
yield wrapped_chunk.pop()
With this memory optimization, you can now do something like:
for big_chunk in chunker(some_generator, chunksize=10000):
... process big_chunk
del big_chunk # big_chunk ready to be garbage-collected :-)
... do more stuff
Inside a program, when you call a function that has a yield statement, as soon as a yield is encountered, the execution of the function stops and returns an object of the generator to the function caller.
yield in Python can be used like the return statement in a function. When done so, the function instead of returning the output, it returns a generator that can be iterated upon. You can then iterate through the generator to extract items. Iterating is done using a for loop or simply using the next() function.
We should use yield when we want to iterate over a sequence, but don't want to store the entire sequence in memory. Yield are used in Python generators. A generator function is defined like a normal function, but whenever it needs to generate a value, it does so with the yield keyword rather than return.
The yield statement hauls the function and returns back the value to the function caller and restart from where it is left off. The yield statement can be called multiple times. While the return statement ends the execution of the function and returns the value back to the caller.
After yield chunk
, the variable value is never used again in the function, so a good interpreter/garbage collector will already free chunk
for garbage collection (note: cpython 2.7 seems not do this, pypy 1.6 with default gc does). Therefore, you don't have to change anything but your code example, which is missing the second argument to grouper
.
Note that garbage collection is non-deterministic in Python. The null garbage collector, which doesn't collect free objects at all, is a perfectly valid garbage collector. From the Python manual:
Objects are never explicitly destroyed; however, when they become unreachable they may be garbage-collected. An implementation is allowed to postpone garbage collection or omit it altogether — it is a matter of implementation quality how garbage collection is implemented, as long as no objects are collected that are still reachable.
Therefore, it can not be decided whether a Python program does or "doesn't take up memory" without specifying Python implementation and garbage collector. Given a specific Python implementation and garbage collector, you can use the gc
module to test whether the object is freed.
That being said, if you really want no reference from the function (not necessarily meaning the object will be garbage-collected), here's how to do it:
def grouper(iterable, chunksize):
i = iter(iterable)
while True:
tmpr = [list(itertools.islice(i, int(chunksize)))]
if not tmpr[0]:
break
yield tmpr.pop()
Instead of a list, you can also use any other data structure that with a function which removes and returns an object, like Owen's wrapper.
If you really really want to get this functionality I suppose you could use a wrapper:
class Wrap:
def __init__(self, val):
self.val = val
def unlink(self):
val = self.val
self.val = None
return val
And could be used like
def grouper(iterable, chunksize):
i = iter(iterable)
while True:
chunk = Wrap(list(itertools.islice(i, int(chunksize))))
if not chunk.val:
break
yield chunk.unlink()
Which is essentially the same as what phihag does with pop()
;)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With