Python urllib2.urlopen() is slow, need a better way to read several urls

Tags:

As the title suggests, I'm working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.

As I have to read 5-10 sites, the page takes a while to load.

I'm just wondering if there's a way to read the sites all at once? Or anytricks to make it faster, like should I close the urllib2.urlopen after each read, or keep it open?

Added: also, if I were to just switch over to php, would that be faster for fetching and Parsi g HTML and XML files from other sites? I just want it to load faster, as opposed to the ~20 seconds it currently takes

911

asked Aug 12 '10 22:08

Jack z

1 Answers

I'm rewriting Dumb Guy's code below using modern Python modules like threading and Queue.

import threading, urllib2
import Queue

urls_to_load = [
'http://stackoverflow.com/',
'http://slashdot.org/',
'http://www.archive.org/',
'http://www.yahoo.co.jp/',
]

def read_url(url, queue):
    data = urllib2.urlopen(url).read()
    print('Fetched %s from %s' % (len(data), url))
    queue.put(data)

def fetch_parallel():
    result = Queue.Queue()
    threads = [threading.Thread(target=read_url, args = (url,result)) for url in urls_to_load]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return result

def fetch_sequencial():
    result = Queue.Queue()
    for url in urls_to_load:
        read_url(url,result)
    return result

Best time for find_sequencial() is 2s. Best time for fetch_parallel() is 0.9s.

Also it is incorrect to say thread is useless in Python because of GIL. This is one of those case when thread is useful in Python because the the threads are blocked on I/O. As you can see in my result the parallel case is 2 times faster.

136

answered Sep 23 '22 18:09

Wai Yip Tung

Related questions
                            
                                Converting a dataframe into JSON (in pyspark) and then selecting desired fields
                            
                                Global variables in Flask templates
                            
                                Python: Justifying NumPy array
                            
                                kombu.exceptions.EncodeError: User is not JSON serializable
                            
                                Calculate the accuracy every epoch in PyTorch
                            
                                What is the syntactical equivalent to switch/case in Python? [duplicate]
                            
                                Get min and max values of categorical variable in a dataframe
                            
                                Extending base classes in Python
                            
                                How to work with unicode in Python
                            
                                Run a task at specific intervals in python [duplicate]
                            
                                convert string to dict using list comprehension
                            
                                Problem with newlines when I use toprettyxml()
                            
                                How to copy a file from a network share to local disk with variables?
                            
                                Django form validation: making "required" conditional?
                            
                                Automatically deleting pyc files when corresponding py is moved (Mercurial)
                            
                                Efficiently finding the shortest path in large graphs
                            
                                Overwrite global var in one line in Python?
                            
                                Time to decimal time in Python
                            
                                How to create a user in linux using python
                            
                                python adds "E" to string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python urllib2.urlopen() is slow, need a better way to read several urls

Tags:

python

http

concurrency

urllib2

Jack z

People also ask

1 Answers

Wai Yip Tung

Recent Activity

Donate For Us