Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parallelize file downloads?

I can download a file at a time with:

import urllib.request

urls = ['foo.com/bar.gz', 'foobar.com/barfoo.gz', 'bar.com/foo.gz']

for u in urls:
  urllib.request.urlretrieve(u)

I could try to subprocess it as such:

import subprocess
import os

def parallelized_commandline(command, files, max_processes=2):
    processes = set()
    for name in files:
        processes.add(subprocess.Popen([command, name]))
        if len(processes) >= max_processes:
            os.wait()
            processes.difference_update(
                [p for p in processes if p.poll() is not None])

    #Check if all the child processes were closed
    for p in processes:
        if p.poll() is None:
            p.wait()

urls = ['http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.en.gz',
'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.cs.gz', 
'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.de.gz']

parallelized_commandline('wget', urls)

Is there any way to parallelize urlretrieve without using os.system or subprocess to cheat?

Given that I must resort to the "cheat" for now, is subprocess.Popen the right way to download the data?

When using the parallelized_commandline() above, it's using multi-thread but not multi-core for the wget, is that normal? Is there a way to make it multi-core instead of multi-thread?

like image 780
alvas Avatar asked Aug 03 '15 09:08

alvas


People also ask

How do I download multiple files from URL?

As indicated, you can draw a rectangle around the links you want to select. This will highlight the links in yellow. From there you can either hit Enter to open the selected links in the same window, “Shift + Enter” to open in a new window, or “Alt + Enter” to download them.

How do you download multiple files in Python?

Download multiple files with a Python loop To download the list of URLs to the associated files, loop through the iterable ( inputs ) that we created, passing each element to download_url . After each download is complete we will print the downloaded URL and the time it took to download.


Video Answer


1 Answers

You could use a thread pool to download files in parallel:

#!/usr/bin/env python3
from multiprocessing.dummy import Pool # use threads for I/O bound tasks
from urllib.request import urlretrieve

urls = [...]
result = Pool(4).map(urlretrieve, urls) # download 4 files at a time

You could also download several files at once in a single thread using asyncio:

#!/usr/bin/env python3
import asyncio
import logging
from contextlib import closing
import aiohttp # $ pip install aiohttp

@asyncio.coroutine
def download(url, session, semaphore, chunk_size=1<<15):
    with (yield from semaphore): # limit number of concurrent downloads
        filename = url2filename(url)
        logging.info('downloading %s', filename)
        response = yield from session.get(url)
        with closing(response), open(filename, 'wb') as file:
            while True: # save file
                chunk = yield from response.content.read(chunk_size)
                if not chunk:
                    break
                file.write(chunk)
        logging.info('done %s', filename)
    return filename, (response.status, tuple(response.headers.items()))

urls = [...]
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
with closing(asyncio.get_event_loop()) as loop, \
     closing(aiohttp.ClientSession()) as session:
    semaphore = asyncio.Semaphore(4)
    download_tasks = (download(url, session, semaphore) for url in urls)
    result = loop.run_until_complete(asyncio.gather(*download_tasks))

where url2filename() is defined here.

like image 186
jfs Avatar answered Oct 07 '22 07:10

jfs