Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 2.6: parallel parsing with urllib2

I'm currently retrieving and parsing pages from a website using urllib2. However, there are many of them (more than 1000), and processing them sequentially is painfully slow.

I was hoping there was a way to retrieve and parse pages in a parallel fashion. If that's a good idea, is it possible, and how do I do it?

Also, what are "reasonable" values for the number of pages to process in parallel (I wouldn't want to put too much strain on the server or get banned because I'm using too many connections)?

Thanks!

like image 286
Anthony Labarre Avatar asked May 26 '26 03:05

Anthony Labarre


1 Answers

You can always use threads (i.e. run each download in a separate thread). For large numbers, this could be a little too resource hogging, in which case I recommend you take a look at gevent and specifically this example, which may be just what you need.

(from gevent.org: "gevent is a coroutine-based Python networking library that uses greenlet to provide a high-level synchronous API on top of libevent event loop")

like image 108
adamk Avatar answered May 27 '26 15:05

adamk