Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Process blocked by urllib2

I set up a process that read a queue for incoming urls to download but when urllib2 open a connection the system hangs.

import urllib2, multiprocessing
from threading import Thread
from Queue import Queue
from multiprocessing import Queue as ProcessQueue, Process

def download(url):
    """Download a page from an url.
    url [str]: url to get.
    return [unicode]: page downloaded.
    """
    if settings.DEBUG:
        print u'Downloading %s' % url
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    encoding = response.headers['content-type'].split('charset=')[-1]
    content = unicode(response.read(), encoding)
    return content

def downloader(url_queue, page_queue):
    def _downloader(url_queue, page_queue):
        while True:
            try:
                url = url_queue.get()
                page_queue.put_nowait({'url': url, 'page': download(url)})
            except Exception, err:
                print u'Error downloading %s' % url
                raise err
            finally:
                url_queue.task_done()

    ## Init internal workers
    internal_url_queue = Queue()
    internal_page_queue = Queue()
    for num in range(multiprocessing.cpu_count()):
        worker = Thread(target=_downloader, args=(internal_url_queue, internal_page_queue))
        worker.setDaemon(True)
        worker.start()

    # Loop waiting closing
    for url in iter(url_queue.get, 'STOP'):
        internal_url_queue.put(url)

    # Wait for closing
    internal_url_queue.join()

# Init the queues
url_queue = ProcessQueue()
page_queue = ProcessQueue()

# Init the process
download_worker = Process(target=downloader, args=(url_queue, page_queue))
download_worker.start()

From another module I can add urls and when I want I can stop the process and wait the process closing.

import module

module.url_queue.put('http://foobar1')
module.url_queue.put('http://foobar2')
module.url_queue.put('http://foobar3')
module.url_queue.put('STOP')
downloader.download_worker.join()

The problem is that when I use urlopen ("response = urllib2.urlopen(request)") it remain all blocked.

There are no problem if I call the download() function or when I use only threads without Process.

like image 692
Davide Muzzarelli Avatar asked Jan 26 '10 02:01

Davide Muzzarelli


People also ask

What does urllib2 do in Python?

Urllib package is the URL handling module for python. It is used to fetch URLs (Uniform Resource Locators). It uses the urlopen function and is able to fetch URLs using a variety of different protocols.

Can I use urllib2 in python3?

NOTE: urllib2 is no longer available in Python 3 You can get more idea about urllib.

Is urllib2 deprecated?

urllib2 is deprecated in python 3. x. use urllib instaed.

What does urllib2 Urlopen do?

urllib2 offers a very simple interface, in the form of the urlopen function. Just pass the URL to urlopen() to get a “file-like” handle to the remote data. like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers.


1 Answers

The issue here is not urllib2, but the use of the multiprocessing module. When using the multiprocessing module under Windows, you must not use code that runs immediately when importing your module - instead, put things in the main module inside a if __name__=='__main__' block. See section "Safe importing of main module" here.

For your code, make this change following in the downloader module:

#....
def start():
    global download_worker
    download_worker = Process(target=downloader, args=(url_queue, page_queue))
    download_worker.start()

And in the main module:

import module
if __name__=='__main__':
    module.start()
    module.url_queue.put('http://foobar1')
    #....

Because you didn't do this, each time the subprocess was started it would run the main code again and start another process, causing the hang.

like image 167
interjay Avatar answered Sep 23 '22 20:09

interjay