I have a script that fetches several web pages and parses the info. (An example can be seen at http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01 ) I ran cProfile on it, and as I assumed, urlopen takes up a lot of time. Is there a way to fetch the pages faster? Or a way to fetch several pages at once? I'll do whatever is simplest, as I'm new to python and web developing. Thanks in advance! :) UPDATE: I have a function called <code>fetchURLs()</code>, which I use to make an array of the URLs I need so something like <code>urls = fetchURLS()</code>.The URLS are all XML files from Amazon and eBay APIs (which confuses me as to why it takes so long to load, maybe my webhost is slow?) What I need to do is load each URL, read each page, and send that data to another part of the script which will parse and display the data. Note that I can't do the latter part until ALL of the pages have been fetched, that's what my issue is. Also, my host limits me to 25 processes at a time, I believe, so whatever is easiest on the server would be nice :) <hr> Here it is for time: <pre class="prettyprint"><code>Sun Aug 15 20:51:22 2010 prof 211352 function calls (209292 primitive calls) in 22.254 CPU seconds Ordered by: internal time List reduced from 404 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 10 18.056 1.806 18.056 1.806 {_socket.getaddrinfo} 4991 2.730 0.001 2.730 0.001 {method 'recv' of '_socket.socket' objects} 10 0.490 0.049 0.490 0.049 {method 'connect' of '_socket.socket' objects} 2415 0.079 0.000 0.079 0.000 {method 'translate' of 'unicode' objects} 12 0.061 0.005 0.745 0.062 /usr/local/lib/python2.6/HTMLParser.py:132(goahead) 3428 0.060 0.000 0.202 0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1306(endData) 1698 0.055 0.000 0.068 0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1351(_smartPop) 4125 0.053 0.000 0.056 0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:118(setup) 1698 0.042 0.000 0.358 0.000 /usr/local/lib/python2.6/HTMLParser.py:224(parse_starttag) 1698 0.042 0.000 0.275 0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1397(unknown_starttag) </code></pre>

Use twisted! It makes this kind of thing absurdly easy compared to, say, using threads. <pre class="prettyprint"><code>from twisted.internet import defer, reactor from twisted.web.client import getPage import time def processPage(page, url): # do somewthing here. return url, len(page) def printResults(result): for success, value in result: if success: print 'Success:', value else: print 'Failure:', value.getErrorMessage() def printDelta(_, start): delta = time.time() - start print 'ran in %0.3fs' % (delta,) return delta urls = [ 'http://www.google.com/', 'http://www.lycos.com/', 'http://www.bing.com/', 'http://www.altavista.com/', 'http://achewood.com/', ] def fetchURLs(): callbacks = [] for url in urls: d = getPage(url) d.addCallback(processPage, url) callbacks.append(d) callbacks = defer.DeferredList(callbacks) callbacks.addCallback(printResults) return callbacks @defer.inlineCallbacks def main(): times = [] for x in xrange(5): d = fetchURLs() d.addCallback(printDelta, time.time()) times.append((yield d)) print 'avg time: %0.3fs' % (sum(times) / len(times),) reactor.callWhenRunning(main) reactor.run() </code></pre> This code also performs better than any of the other solutions posted (edited after I closed some things that were using a lot of bandwidth): <pre class="prettyprint"><code>Success: ('http://www.google.com/', 8135) Success: ('http://www.lycos.com/', 29996) Success: ('http://www.bing.com/', 28611) Success: ('http://www.altavista.com/', 8378) Success: ('http://achewood.com/', 15043) ran in 0.518s Success: ('http://www.google.com/', 8135) Success: ('http://www.lycos.com/', 30349) Success: ('http://www.bing.com/', 28611) Success: ('http://www.altavista.com/', 8378) Success: ('http://achewood.com/', 15043) ran in 0.461s Success: ('http://www.google.com/', 8135) Success: ('http://www.lycos.com/', 30033) Success: ('http://www.bing.com/', 28611) Success: ('http://www.altavista.com/', 8378) Success: ('http://achewood.com/', 15043) ran in 0.435s Success: ('http://www.google.com/', 8117) Success: ('http://www.lycos.com/', 30349) Success: ('http://www.bing.com/', 28611) Success: ('http://www.altavista.com/', 8378) Success: ('http://achewood.com/', 15043) ran in 0.449s Success: ('http://www.google.com/', 8135) Success: ('http://www.lycos.com/', 30349) Success: ('http://www.bing.com/', 28611) Success: ('http://www.altavista.com/', 8378) Success: ('http://achewood.com/', 15043) ran in 0.547s avg time: 0.482s </code></pre> And using Nick T's code, rigged up to also give the average of five and show the output better: <pre class="prettyprint"><code>Starting threaded reads: ...took 1.921520 seconds ([8117, 30070, 15043, 8386, 28611]) Starting threaded reads: ...took 1.779461 seconds ([8135, 15043, 8386, 30349, 28611]) Starting threaded reads: ...took 1.756968 seconds ([8135, 8386, 15043, 30349, 28611]) Starting threaded reads: ...took 1.762956 seconds ([8386, 8135, 15043, 29996, 28611]) Starting threaded reads: ...took 1.654377 seconds ([8117, 30349, 15043, 8386, 28611]) avg time: 1.775s Starting sequential reads: ...took 1.389803 seconds ([8135, 30147, 28611, 8386, 15043]) Starting sequential reads: ...took 1.457451 seconds ([8135, 30051, 28611, 8386, 15043]) Starting sequential reads: ...took 1.432214 seconds ([8135, 29996, 28611, 8386, 15043]) Starting sequential reads: ...took 1.447866 seconds ([8117, 30028, 28611, 8386, 15043]) Starting sequential reads: ...took 1.468946 seconds ([8153, 30051, 28611, 8386, 15043]) avg time: 1.439s </code></pre> And using Wai Yip Tung's code: <pre class="prettyprint"><code>Fetched 8117 from http://www.google.com/ Fetched 28611 from http://www.bing.com/ Fetched 8386 from http://www.altavista.com/ Fetched 30051 from http://www.lycos.com/ Fetched 15043 from http://achewood.com/ done in 0.704s Fetched 8117 from http://www.google.com/ Fetched 28611 from http://www.bing.com/ Fetched 8386 from http://www.altavista.com/ Fetched 30114 from http://www.lycos.com/ Fetched 15043 from http://achewood.com/ done in 0.845s Fetched 8153 from http://www.google.com/ Fetched 28611 from http://www.bing.com/ Fetched 8386 from http://www.altavista.com/ Fetched 30070 from http://www.lycos.com/ Fetched 15043 from http://achewood.com/ done in 0.689s Fetched 8117 from http://www.google.com/ Fetched 28611 from http://www.bing.com/ Fetched 8386 from http://www.altavista.com/ Fetched 30114 from http://www.lycos.com/ Fetched 15043 from http://achewood.com/ done in 0.647s Fetched 8135 from http://www.google.com/ Fetched 28611 from http://www.bing.com/ Fetched 8386 from http://www.altavista.com/ Fetched 30349 from http://www.lycos.com/ Fetched 15043 from http://achewood.com/ done in 0.693s avg time: 0.715s </code></pre> I've gotta say, I do like that the sequential fetches performed better for me.

How can I speed up fetching pages with urllib2 in python?

Tags:

python

time

urllib2

urlopen

cprofile

I have a script that fetches several web pages and parses the info.

(An example can be seen at http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01 )

I ran cProfile on it, and as I assumed, urlopen takes up a lot of time. Is there a way to fetch the pages faster? Or a way to fetch several pages at once? I'll do whatever is simplest, as I'm new to python and web developing.

Thanks in advance! :)

UPDATE: I have a function called fetchURLs(), which I use to make an array of the URLs I need so something like urls = fetchURLS().The URLS are all XML files from Amazon and eBay APIs (which confuses me as to why it takes so long to load, maybe my webhost is slow?)

What I need to do is load each URL, read each page, and send that data to another part of the script which will parse and display the data.

Note that I can't do the latter part until ALL of the pages have been fetched, that's what my issue is.

Also, my host limits me to 25 processes at a time, I believe, so whatever is easiest on the server would be nice :)

Here it is for time:

Sun Aug 15 20:51:22 2010    prof           211352 function calls (209292 primitive calls) in 22.254 CPU seconds     Ordered by: internal time    List reduced from 404 to 10 due to restriction <10>     ncalls  tottime  percall  cumtime  percall filename:lineno(function)        10   18.056    1.806   18.056    1.806 {_socket.getaddrinfo}      4991    2.730    0.001    2.730    0.001 {method 'recv' of '_socket.socket' objects}        10    0.490    0.049    0.490    0.049 {method 'connect' of '_socket.socket' objects}      2415    0.079    0.000    0.079    0.000 {method 'translate' of 'unicode' objects}        12    0.061    0.005    0.745    0.062 /usr/local/lib/python2.6/HTMLParser.py:132(goahead)      3428    0.060    0.000    0.202    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1306(endData)      1698    0.055    0.000    0.068    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1351(_smartPop)      4125    0.053    0.000    0.056    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:118(setup)      1698    0.042    0.000    0.358    0.000 /usr/local/lib/python2.6/HTMLParser.py:224(parse_starttag)      1698    0.042    0.000    0.275    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1397(unknown_starttag)

391

asked Aug 16 '10 02:08

Parker

2 Answers

EDIT: I'm expanding the answer to include a more polished example. I have found a lot hostility and misinformation in this post regarding threading v.s. async I/O. Therefore I also adding more argument to refute certain invalid claim. I hope this will help people to choose the right tool for the right job.

This is a dup to a question 3 days ago.

Python urllib2.open is slow, need a better way to read several urls - Stack Overflow Python urllib2.urlopen() is slow, need a better way to read several urls

I'm polishing the code to show how to fetch multiple webpage in parallel using threads.

import time import threading import Queue  # utility - spawn a thread to execute target for each args def run_parallel_in_threads(target, args_list):     result = Queue.Queue()     # wrapper to collect return value in a Queue     def task_wrapper(*args):         result.put(target(*args))     threads = [threading.Thread(target=task_wrapper, args=args) for args in args_list]     for t in threads:         t.start()     for t in threads:         t.join()     return result  def dummy_task(n):     for i in xrange(n):         time.sleep(0.1)     return n  # below is the application code urls = [     ('http://www.google.com/',),     ('http://www.lycos.com/',),     ('http://www.bing.com/',),     ('http://www.altavista.com/',),     ('http://achewood.com/',), ]  def fetch(url):     return urllib2.urlopen(url).read()  run_parallel_in_threads(fetch, urls)

As you can see, the application specific code has only 3 lines, which can be collapsed into 1 line if you are aggressive. I don't think anyone can justify their claim that this is complex and unmaintainable.

Unfortunately most other threading code posted here has some flaws. Many of them do active polling to wait for the code to finish. join() is a better way to synchronize the code. I think this code has improved upon all the threading examples so far.

keep-alive connection

WoLpH's suggestion about using keep-alive connection could be very useful if all you URLs are pointing to the same server.

twisted

Aaron Gallagher is a fans of twisted framework and he is hostile any people who suggest thread. Unfortunately a lot of his claims are misinformation. For example he said "-1 for suggesting threads. This is IO-bound; threads are useless here." This contrary to evidence as both Nick T and I have demonstrated speed gain from the using thread. In fact I/O bound application has the most to gain from using Python's thread (v.s. no gain in CPU bound application). Aaron's misguided criticism on thread shows he is rather confused about parallel programming in general.

Right tool for the right job

I'm well aware of the issues pertain to parallel programming using threads, python, async I/O and so on. Each tool has their pros and cons. For each situation there is an appropriate tool. I'm not against twisted (though I have not deployed one myself). But I don't believe we can flat out say that thread is BAD and twisted is GOOD in all situations.

For example, if the OP's requirement is to fetch 10,000 website in parallel, async I/O will be prefereable. Threading won't be appropriable (unless maybe with stackless Python).

Aaron's opposition to threads are mostly generalizations. He fail to recognize that this is a trivial parallelization task. Each task is independent and do not share resources. So most of his attack do not apply.

Given my code has no external dependency, I'll call it right tool for the right job.

Performance

I think most people would agree that performance of this task is largely depend on the networking code and the external server, where the performance of platform code should have negligible effect. However Aaron's benchmark show an 50% speed gain over the threaded code. I think it is necessary to response to this apparent speed gain.

In Nick's code, there is an obvious flaw that caused the inefficiency. But how do you explain the 233ms speed gain over my code? I think even twisted fans will refrain from jumping into conclusion to attribute this to the efficiency of twisted. There are, after all, a huge amount of variable outside of the system code, like the remote server's performance, network, caching, and difference implementation between urllib2 and twisted web client and so on.

Just to make sure Python's threading will not incur a huge amount of inefficiency, I do a quick benchmark to spawn 5 threads and then 500 threads. I am quite comfortable to say the overhead of spawning 5 thread is negligible and cannot explain the 233ms speed difference.

In [274]: %time run_parallel_in_threads(dummy_task, [(0,)]*5) CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s Wall time: 0.00 s Out[275]: <Queue.Queue instance at 0x038B2878>  In [276]: %time run_parallel_in_threads(dummy_task, [(0,)]*500) CPU times: user 0.16 s, sys: 0.00 s, total: 0.16 s Wall time: 0.16 s  In [278]: %time run_parallel_in_threads(dummy_task, [(10,)]*500) CPU times: user 1.13 s, sys: 0.00 s, total: 1.13 s Wall time: 1.13 s       <<<<<<<< This means 0.13s of overhead

Further testing on my parallel fetching shows a huge variability in the response time in 17 runs. (Unfortunately I don't have twisted to verify Aaron's code).

0.75 s 0.38 s 0.59 s 0.38 s 0.62 s 1.50 s 0.49 s 0.36 s 0.95 s 0.43 s 0.61 s 0.81 s 0.46 s 1.21 s 2.87 s 1.04 s 1.72 s

My testing does not support Aaron's conclusion that threading is consistently slower than async I/O by a measurable margin. Given the number of variables involved, I have to say this is not a valid test to measure the systematic performance difference between async I/O and threading.

158

answered Sep 26 '22 07:09

Wai Yip Tung

Use twisted! It makes this kind of thing absurdly easy compared to, say, using threads.

from twisted.internet import defer, reactor from twisted.web.client import getPage import time  def processPage(page, url):     # do somewthing here.     return url, len(page)  def printResults(result):     for success, value in result:         if success:             print 'Success:', value         else:             print 'Failure:', value.getErrorMessage()  def printDelta(_, start):     delta = time.time() - start     print 'ran in %0.3fs' % (delta,)     return delta  urls = [     'http://www.google.com/',     'http://www.lycos.com/',     'http://www.bing.com/',     'http://www.altavista.com/',     'http://achewood.com/', ]  def fetchURLs():     callbacks = []     for url in urls:         d = getPage(url)         d.addCallback(processPage, url)         callbacks.append(d)      callbacks = defer.DeferredList(callbacks)     callbacks.addCallback(printResults)     return callbacks  @defer.inlineCallbacks def main():     times = []     for x in xrange(5):         d = fetchURLs()         d.addCallback(printDelta, time.time())         times.append((yield d))     print 'avg time: %0.3fs' % (sum(times) / len(times),)  reactor.callWhenRunning(main) reactor.run()

This code also performs better than any of the other solutions posted (edited after I closed some things that were using a lot of bandwidth):

Success: ('http://www.google.com/', 8135) Success: ('http://www.lycos.com/', 29996) Success: ('http://www.bing.com/', 28611) Success: ('http://www.altavista.com/', 8378) Success: ('http://achewood.com/', 15043) ran in 0.518s Success: ('http://www.google.com/', 8135) Success: ('http://www.lycos.com/', 30349) Success: ('http://www.bing.com/', 28611) Success: ('http://www.altavista.com/', 8378) Success: ('http://achewood.com/', 15043) ran in 0.461s Success: ('http://www.google.com/', 8135) Success: ('http://www.lycos.com/', 30033) Success: ('http://www.bing.com/', 28611) Success: ('http://www.altavista.com/', 8378) Success: ('http://achewood.com/', 15043) ran in 0.435s Success: ('http://www.google.com/', 8117) Success: ('http://www.lycos.com/', 30349) Success: ('http://www.bing.com/', 28611) Success: ('http://www.altavista.com/', 8378) Success: ('http://achewood.com/', 15043) ran in 0.449s Success: ('http://www.google.com/', 8135) Success: ('http://www.lycos.com/', 30349) Success: ('http://www.bing.com/', 28611) Success: ('http://www.altavista.com/', 8378) Success: ('http://achewood.com/', 15043) ran in 0.547s avg time: 0.482s

And using Nick T's code, rigged up to also give the average of five and show the output better:

Starting threaded reads: ...took 1.921520 seconds ([8117, 30070, 15043, 8386, 28611]) Starting threaded reads: ...took 1.779461 seconds ([8135, 15043, 8386, 30349, 28611]) Starting threaded reads: ...took 1.756968 seconds ([8135, 8386, 15043, 30349, 28611]) Starting threaded reads: ...took 1.762956 seconds ([8386, 8135, 15043, 29996, 28611]) Starting threaded reads: ...took 1.654377 seconds ([8117, 30349, 15043, 8386, 28611]) avg time: 1.775s  Starting sequential reads: ...took 1.389803 seconds ([8135, 30147, 28611, 8386, 15043]) Starting sequential reads: ...took 1.457451 seconds ([8135, 30051, 28611, 8386, 15043]) Starting sequential reads: ...took 1.432214 seconds ([8135, 29996, 28611, 8386, 15043]) Starting sequential reads: ...took 1.447866 seconds ([8117, 30028, 28611, 8386, 15043]) Starting sequential reads: ...took 1.468946 seconds ([8153, 30051, 28611, 8386, 15043]) avg time: 1.439s

And using Wai Yip Tung's code:

Fetched 8117 from http://www.google.com/ Fetched 28611 from http://www.bing.com/ Fetched 8386 from http://www.altavista.com/ Fetched 30051 from http://www.lycos.com/ Fetched 15043 from http://achewood.com/ done in 0.704s Fetched 8117 from http://www.google.com/ Fetched 28611 from http://www.bing.com/ Fetched 8386 from http://www.altavista.com/ Fetched 30114 from http://www.lycos.com/ Fetched 15043 from http://achewood.com/ done in 0.845s Fetched 8153 from http://www.google.com/ Fetched 28611 from http://www.bing.com/ Fetched 8386 from http://www.altavista.com/ Fetched 30070 from http://www.lycos.com/ Fetched 15043 from http://achewood.com/ done in 0.689s Fetched 8117 from http://www.google.com/ Fetched 28611 from http://www.bing.com/ Fetched 8386 from http://www.altavista.com/ Fetched 30114 from http://www.lycos.com/ Fetched 15043 from http://achewood.com/ done in 0.647s Fetched 8135 from http://www.google.com/ Fetched 28611 from http://www.bing.com/ Fetched 8386 from http://www.altavista.com/ Fetched 30349 from http://www.lycos.com/ Fetched 15043 from http://achewood.com/ done in 0.693s avg time: 0.715s

I've gotta say, I do like that the sequential fetches performed better for me.

answered Sep 25 '22 07:09

habnabit

Related questions
                            
                                Understanding tf.extract_image_patches for extracting patches from an image
                            
                                Python sets: difference() vs symmetric_difference()
                            
                                How do I iterate over a Python dictionary, ordered by values?
                            
                                Getting CPU temperature using Python?
                            
                                Pygame, sounds don't play
                            
                                2D grid data visualization in Python
                            
                                Is there Windows analog to supervisord?
                            
                                how to get the average of dataframe column values
                            
                                Selenium and iframe in html
                            
                                Copying code into word document and keeping formatting [closed]
                            
                                Programmatically get current IPython notebook cell output?
                            
                                Generate a random sample of points distributed on the surface of a unit sphere
                            
                                Django: Support for string view arguments to url() is deprecated and will be removed in Django 1.10
                            
                                Access multiple items with not equal to, !=
                            
                                Plot won't show in Jupyter
                            
                                Jupyter magic to handle notebook exceptions
                            
                                Rename and move file with Python
                            
                                How to plot a dashed line on seaborn lineplot?
                            
                                10 Minutes to Pandas tutorial - to_numpy() does not exist?
                            
                                What is the "sys.stdout.write()" equivalent in Ruby?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With