Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why socket implementation is slower than requests?

I have a python 3.4 script fetching multiple web pages. At first, I used requests library to fetch pages:

def get_page_requsets(url):
    r = requests.get(url)
    return r.content

Above code gives an average speed of 4.6 requests per second. To increase speed I rewrote function to use sockets library:

def get_page_socket(url):

    url = urlparse(url)
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.connect((url.netloc, 80))
    req = '''
GET {} HTTP/1.1\r
Host: {}\r
Connection: Keep-Alive\r
\r
    '''.format(url.path, url.host, uagent)
    sock.send(req.encode())
    reply = b''
    while True:
        chunk = sock.recv(65535)
        if chunk:
            reply += chunk
        else:
            break
    sock.close()
    return reply

And average speed fell to 4.04 requests per second. I was not hoping for drammatic speed boost, but was hoping for slight increase, as socket is more low level. Is this library issue or I'm doing something wrong?

like image 988
eyeinthebrick Avatar asked Sep 06 '14 23:09

eyeinthebrick


1 Answers

requests uses urllib3, which handles HTTP connections very efficiently. Connections to the same server are re-used wherever possible, saving you the socket connection and teardown costs:

  • Re-use the same socket connection for multiple requests, with optional client-side certificate verification. See: HTTPConnectionPool and HTTPSConnectionPool

In addition, urllib3 and requests advertise to the server that they can handle compressed responses; with compression you can transfer more data in the same amount of time, leading to more requests per second.

  • Supports gzip and deflate decoding. See: decode_gzip() and decode_deflate()

urllib3 uses sockets too (albeit via the http.client module); there is little point in reinventing this wheel. Perhaps you should think about fetching URLs in parallel instead, using threading or multiprocessing, or eventlets; the requests author has a gevents-requests integration package that can help there. Another way of achieving concurrency would be to use asyncio combined with aiohttp as HTTP requests are mostly waiting for network I/O.

like image 185
Martijn Pieters Avatar answered Oct 31 '22 16:10

Martijn Pieters