Recently I am working on a tiny crawler for downloading images on a url.
I use openurl() in urllib2 with f.open()/f.write():
Here is the code snippet:
# the list for the images' urls
imglist = re.findall(regImg,pageHtml)
# iterate to download images
for index in xrange(1,len(imglist)+1):
img = urllib2.urlopen(imglist[index-1])
f = open(r'E:\OK\%s.jpg' % str(index), 'wb')
print('To Read...')
# potential timeout, may block for a long time
# so I wonder whether there is any mechanism to enable retry when time exceeds a certain threshold
f.write(img.read())
f.close()
print('Image %d is ready !' % index)
In the code above, the img.read() will potentially block for a long time, I hope to do some retry/re-open the image url operation under this issue.
I also concern on the efficient perspective of the code above, if the number of the images to be downloaded is somewhat big, using a thread pool to download them seems to be better.
Any suggestions? Thanks in advance.
p.s. I found the read() method on img object may cause blocking, so adding a timeout parameter to the urlopen() alone seems useless. But I found file object has no timeout version of read(). Any suggestions on this ? Thanks very much .
The urllib2.urlopen has a timeout parameter which is used for all blocking operations (connection buildup etc.)
This snippet is taken from one of my projects. I use a thread pool to download multiple files at once. It uses urllib.urlretrieve but the logic is the same. The url_and_path_list is a list of (url, path) tuples, the num_concurrent is the number of threads to be spawned, and the skip_existing skips downloading of files if they already exist in the filesystem.
def download_urls(url_and_path_list, num_concurrent, skip_existing):
# prepare the queue
queue = Queue.Queue()
for url_and_path in url_and_path_list:
queue.put(url_and_path)
# start the requested number of download threads to download the files
threads = []
for _ in range(num_concurrent):
t = DownloadThread(queue, skip_existing)
t.daemon = True
t.start()
queue.join()
class DownloadThread(threading.Thread):
def __init__(self, queue, skip_existing):
super(DownloadThread, self).__init__()
self.queue = queue
self.skip_existing = skip_existing
def run(self):
while True:
#grabs url from queue
url, path = self.queue.get()
if self.skip_existing and exists(path):
# skip if requested
self.queue.task_done()
continue
try:
urllib.urlretrieve(url, path)
except IOError:
print "Error downloading url '%s'." % url
#signals to queue job is done
self.queue.task_done()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With