Python 2.6: parallel parsing with urllib2

Question

I'm currently retrieving and parsing pages from a website using urllib2. However, there are many of them (more than 1000), and processing them sequentially is painfully slow.

I was hoping there was a way to retrieve and parse pages in a parallel fashion. If that's a good idea, is it possible, and how do I do it?

Also, what are "reasonable" values for the number of pages to process in parallel (I wouldn't want to put too much strain on the server or get banned because I'm using too many connections)?

Thanks!

adamk · Accepted Answer

You can always use threads (i.e. run each download in a separate thread). For large numbers, this could be a little too resource hogging, in which case I recommend you take a look at gevent and specifically this example, which may be just what you need.

(from gevent.org: "gevent is a coroutine-based Python networking library that uses greenlet to provide a high-level synchronous API on top of libevent event loop")

Python 2.6: parallel parsing with urllib2

Tags:

python

parsing

parallel-processing

urllib2

Anthony Labarre

1 Answers

adamk

Recent Activity

Donate For Us

Python 2.6: parallel parsing with urllib2

Tags:

python

parsing

parallel-processing

urllib2

Anthony Labarre

1 Answers

adamk

Related questions

Recent Activity

Donate For Us