Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pause Python Generator

I have a python generator that does work that produces a large amount of data, which uses up a lot of ram. Is there a way of detecting if the processed data has been "consumed" by the code which is using the generator, and if so, pause until it is consumed?

def multi_grab(urls,proxy=None,ref=None,xpath=False,compress=True,delay=10,pool_size=50,retries=1,http_obj=None):
    if proxy is not None:
        proxy = web.ProxyManager(proxy,delay=delay)
        pool_size = len(pool_size.records)
    work_pool = pool.Pool(pool_size)
    partial_grab = partial(grab,proxy=proxy,post=None,ref=ref,xpath=xpath,compress=compress,include_url=True,retries=retries,http_obj=http_obj)
    for result in work_pool.imap_unordered(partial_grab,urls):
        if result:
            yield result

run from:

if __name__ == '__main__':
    links = set(link for link in grab('http://www.reddit.com',xpath=True).xpath('//a/@href') if link.startswith('http') and 'reddit' not in link)
    print '%s links' % len(links)
    counter = 1
    for url, data in multi_grab(links,pool_size=10):
        print 'got', url, counter, len(data)
        counter += 1
like image 492
Matt Avatar asked Mar 02 '26 09:03

Matt


2 Answers

A generator simply yields values. There's no way for the generator to know what's being done with them.

But the generator also pauses constantly, as the caller does whatever it does. It doesn't execute again until the caller invokes it to get the next value. It doesn't run on a separate thread or anything. It sounds like you have a misconception about how generators work. Can you show some code?

like image 142
Ned Batchelder Avatar answered Mar 05 '26 00:03

Ned Batchelder


The point of a generator in Python is to get rid of extra, unneeded objects after each iteration. The only time it will keep those extra objects (and thus extra ram) is when the objects are being referenced somewhere else (such as adding them to a list). Make sure you aren't saving these variables unnecessarily.

If you're dealing with multithreading/processing, then you probably want to implement a Queue that you could pull data from, keeping track of the number of tasks you're processing.

like image 23
TorelTwiddler Avatar answered Mar 04 '26 22:03

TorelTwiddler



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!