Requests with multiple connections

Question

I use the Python Requests library to download a big file, e.g.:

r = requests.get("http://bigfile.com/bigfile.bin")
content = r.content

The big file downloads at +- 30 Kb per second, which is a bit slow. Every connection to the bigfile server is throttled, so I would like to make multiple connections.

Is there a way to make multiple connections at the same time to download one file?

Vyktor · Accepted Answer

You can use HTTP Range header to fetch just part of file (already covered for python here).

Just start several threads and fetch different range with each and you're done ;)

def download(url,start):
    req = urllib2.Request('http://www.python.org/')
    req.headers['Range'] = 'bytes=%s-%s' % (start, start+chunk_size)
    f = urllib2.urlopen(req)
    parts[start] = f.read()

threads = []
parts = {}

# Initialize threads
for i in range(0,10):
    t = threading.Thread(target=download, i*chunk_size)
    t.start()
    threads.append(t)

# Join threads back (order doesn't matter, you just want them all)
for i in threads:
    i.join()

# Sort parts and you're done
result = ''.join(parts[i] for i in sorted(parts.keys()))

Also note that not every server supports Range header (and especially servers with php scripts responsible for data fetching often don't implement handling of it).

jfs · Answer

Here's a Python script that saves given url to a file and uses multiple threads to download it:

#!/usr/bin/env python
import sys
from functools import partial
from itertools import count, izip
from multiprocessing.dummy import Pool # use threads
from urllib2 import HTTPError, Request, urlopen

def download_chunk(url, byterange):
    req = Request(url, headers=dict(Range='bytes=%d-%d' % byterange))
    try:
        return urlopen(req).read()
    except HTTPError as e:
        return b''  if e.code == 416 else None  # treat range error as EOF
    except EnvironmentError:
        return None

def main():
    url, filename = sys.argv[1:]
    pool = Pool(4) # define number of concurrent connections
    chunksize = 1 << 16
    ranges = izip(count(0, chunksize), count(chunksize - 1, chunksize))
    with open(filename, 'wb') as file:
        for s in pool.imap(partial(download_part, url), ranges):
            if not s:
                break # error or EOF
            file.write(s)
            if len(s) != chunksize:
                break  # EOF (servers with no Range support end up here)

if __name__ == "__main__":
    main()

The end of file is detected if a server returns empty body, or 416 http code, or if the response size is not chunksize exactly.

It supports servers that doesn't understand Range header (everything is downloaded in a single request in this case; to support large files, change download_chunk() to save to a temporary file and return the filename to be read in the main thread instead of the file content itself).

It allows to change independently number of concurrent connections (pool size) and number of bytes requested in a single http request.

To use multiple processes instead of threads, change the import:

from multiprocessing.pool import Pool # use processes (other code unchanged)

Requests with multiple connections

Tags:

python

networking

download

python-requests

TTT

2 Answers

Vyktor

jfs

Recent Activity

Donate For Us

Requests with multiple connections

Tags:

python

networking

download

python-requests

TTT

2 Answers

Vyktor

jfs

Related questions

Recent Activity

Donate For Us