As the title suggests, I'm working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.
As I have to read 5-10 sites, the page takes a while to load.
I'm just wondering if there's a way to read the sites all at once? Or anytricks to make it faster, like should I close the urllib2.urlopen after each read, or keep it open?
Added: also, if I were to just switch over to php, would that be faster for fetching and Parsi g HTML and XML files from other sites? I just want it to load faster, as opposed to the ~20 seconds it currently takes
What is urlopen? urllib2 offers a very simple interface, in the form of the urlopen function. Just pass the URL to urlopen() to get a “file-like” handle to the remote data. like basic authentication, cookies, proxies and so on.
I found that time took to send the data from the client to the server took same time for both modules (urllib, requests) but the time it took to return data from the server to the client is more then twice faster in urllib compare to request.
True, if you want to avoid adding any dependencies, urllib is available. But note that even the Python official documentation recommends the requests library: "The Requests package is recommended for a higher-level HTTP client interface."
NOTE: urllib2 is no longer available in Python 3.
I'm rewriting Dumb Guy's code below using modern Python modules like threading
and Queue
.
import threading, urllib2
import Queue
urls_to_load = [
'http://stackoverflow.com/',
'http://slashdot.org/',
'http://www.archive.org/',
'http://www.yahoo.co.jp/',
]
def read_url(url, queue):
data = urllib2.urlopen(url).read()
print('Fetched %s from %s' % (len(data), url))
queue.put(data)
def fetch_parallel():
result = Queue.Queue()
threads = [threading.Thread(target=read_url, args = (url,result)) for url in urls_to_load]
for t in threads:
t.start()
for t in threads:
t.join()
return result
def fetch_sequencial():
result = Queue.Queue()
for url in urls_to_load:
read_url(url,result)
return result
Best time for find_sequencial()
is 2s. Best time for fetch_parallel()
is 0.9s.
Also it is incorrect to say thread
is useless in Python because of GIL. This is one of those case when thread is useful in Python because the the threads are blocked on I/O. As you can see in my result the parallel case is 2 times faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With