Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

urllib2 urlopen read timeout/block

Recently I am working on a tiny crawler for downloading images on a url.

I use openurl() in urllib2 with f.open()/f.write():

Here is the code snippet:

# the list for the images' urls
imglist = re.findall(regImg,pageHtml)

# iterate to download images
for index in xrange(1,len(imglist)+1):
    img = urllib2.urlopen(imglist[index-1])
    f = open(r'E:\OK\%s.jpg' % str(index), 'wb')
    print('To Read...')

    # potential timeout, may block for a long time
    # so I wonder whether there is any mechanism to enable retry when time exceeds a certain threshold
    f.write(img.read())
    f.close()
    print('Image %d is ready !' % index)

In the code above, the img.read() will potentially block for a long time, I hope to do some retry/re-open the image url operation under this issue.

I also concern on the efficient perspective of the code above, if the number of the images to be downloaded is somewhat big, using a thread pool to download them seems to be better.

Any suggestions? Thanks in advance.

p.s. I found the read() method on img object may cause blocking, so adding a timeout parameter to the urlopen() alone seems useless. But I found file object has no timeout version of read(). Any suggestions on this ? Thanks very much .

like image 760
destiny1020 Avatar asked Apr 26 '26 11:04

destiny1020


1 Answers

The urllib2.urlopen has a timeout parameter which is used for all blocking operations (connection buildup etc.)

This snippet is taken from one of my projects. I use a thread pool to download multiple files at once. It uses urllib.urlretrieve but the logic is the same. The url_and_path_list is a list of (url, path) tuples, the num_concurrent is the number of threads to be spawned, and the skip_existing skips downloading of files if they already exist in the filesystem.

def download_urls(url_and_path_list, num_concurrent, skip_existing):
    # prepare the queue
    queue = Queue.Queue()
    for url_and_path in url_and_path_list:
        queue.put(url_and_path)

    # start the requested number of download threads to download the files
    threads = []
    for _ in range(num_concurrent):
        t = DownloadThread(queue, skip_existing)
        t.daemon = True
        t.start()

    queue.join()

class DownloadThread(threading.Thread):
    def __init__(self, queue, skip_existing):
        super(DownloadThread, self).__init__()
        self.queue = queue
        self.skip_existing = skip_existing

    def run(self):
        while True:
            #grabs url from queue
            url, path = self.queue.get()

            if self.skip_existing and exists(path):
                # skip if requested
                self.queue.task_done()
                continue

            try:
                urllib.urlretrieve(url, path)
            except IOError:
                print "Error downloading url '%s'." % url

            #signals to queue job is done
            self.queue.task_done()
like image 174
Constantinius Avatar answered Apr 28 '26 00:04

Constantinius