Concurrent downloads - Python

Tags:

the plan is this:

I download a webpage, collect a list of images parsed in the DOM and then download these. After this I would iterate through the images in order to evaluate which image is best suited to represent the webpage.

Problem is that images are downloaded 1 by 1 and this can take quite some time.

It would be great if someone could point me in some direction regarding the topic.

Help would be very much appreciated.

320

asked Mar 02 '10 01:03

RadiantHex

1 Answers

Speeding up crawling is basically Eventlet's main use case. It's deeply fast -- we have an application that has to hit 2,000,000 urls in a few minutes. It makes use of the fastest event interface on your system (epoll, generally), and uses greenthreads (which are built on top of coroutines and are very inexpensive) to make it easy to write.

Here's an example from the docs:

urls = ["http://www.google.com/intl/en_ALL/images/logo.gif",
     "https://wiki.secondlife.com/w/images/secondlife.jpg",
     "http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif"]

import eventlet
from eventlet.green import urllib2  

def fetch(url):
  body = urllib2.urlopen(url).read()
  return url, body

pool = eventlet.GreenPool()
for url, body in pool.imap(fetch, urls):
  print "got body from", url, "of length", len(body)

This is a pretty good starting point for developing a more fully-featured crawler. Feel free to pop in to #eventlet on Freenode to ask for help.

[update: I added a more-complex recursive web crawler example to the docs. I swear it was in the works before this question was asked, but the question did finally inspire me to finish it. :)]

129

answered Oct 19 '22 14:10

rdw

Related questions
                            
                                Convert list to column in Python Dataframe
                            
                                Calculate correlation between columns of strings
                            
                                Pandas DataFrame.hist() doesn't work
                            
                                Drop column using Dask dataframe
                            
                                Cannot find `protoc` command
                            
                                Resample xarray object to lower resolution spatially
                            
                                Why does pycharm not recognize my pytest tests and show the test output?
                            
                                Conv1D with kernel_size=1 vs Linear layer
                            
                                Is it possible to generate an executable (.exe) of a jupyter-notebook?
                            
                                How to convert a grayscale image to heatmap image with Python OpenCV
                            
                                None propagation in Python chained attribute access [duplicate]
                            
                                Is there a pythonic way to sample N consecutive elements from a list or numpy array
                            
                                How do I pass a python list in the post query?
                            
                                Using for...else in Python generators
                            
                                How do I resolve an SRV record in Python?
                            
                                Handling Windows-specific exceptions in platform-independent way
                            
                                Python: Set with only existence check?
                            
                                A more pythonic way of iterating a list while excluding an element each iteration
                            
                                UTF-8 and upper()
                            
                                Colorize PyLint Output?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Concurrent downloads - Python

Tags:

python

html

concurrency

web-crawler

RadiantHex

People also ask

1 Answers

rdw

Recent Activity

Donate For Us