Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python download multiple files from links on pages

I'm trying to download all the PGNs from this site.

I think I have to use urlopen to open each url and then use urlretrieve to download each pgn by accessing it from the download button near the bottom of each game. Do I have to create a new BeautifulSoup object for each game? I'm also unsure of how urlretrieve works.

import urllib
from urllib.request import urlopen, urlretrieve, quote
from bs4 import BeautifulSoup

url = 'http://www.chessgames.com/perl/chesscollection?cid=1014492'
u = urlopen(url)
html = u.read().decode('utf-8')

soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a'):
    urlopen('http://chessgames.com'+link.get('href'))
like image 540
Monty Avatar asked Sep 17 '17 12:09

Monty


2 Answers

There is no short answer to your question. I will show you a complete solution and comment this code.

First, import necessary modules:

from bs4 import BeautifulSoup
import requests
import re

Next, get index page and create BeautifulSoup object:

req = requests.get("http://www.chessgames.com/perl/chesscollection?cid=1014492")
soup = BeautifulSoup(req.text, "lxml")

I strongly advice to use lxml parser, not common html.parser After that, you should prepare game's links list:

pages = soup.findAll('a', href=re.compile('.*chessgame\?.*'))

You can do it by searching links containing 'chessgame' word in it. Now, you should prepare function which will download files for you:

def download_file(url):
    path = url.split('/')[-1].split('?')[0]
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        with open(path, 'wb') as f:
            for chunk in r:
                f.write(chunk)

And final magic is to repeat all previous steps preparing links for file downloader:

host = 'http://www.chessgames.com'
for page in pages:
    url = host + page.get('href')
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "lxml")
    file_link = soup.find('a',text=re.compile('.*download.*'))
    file_url = host + file_link.get('href')
    download_file(file_url)

(first you search links containing text 'download' in their description, then construct full url - concatenate hostname and path, and finally download file)

I hope you can use this code without correction!

like image 95
Roman Mindlin Avatar answered Sep 25 '22 06:09

Roman Mindlin


The accepted answer is fantastic but the task is embarrassingly parallel; there's no need to retrieve these sub-pages and files one at a time. This answer shows how to speed things up.

The first step is to use requests.Session() when sending multiple requests to a single host. Quoting Advanced Usage: Session Objects from the requests docs:

The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance, and will use urllib3's connection pooling. So if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase (see HTTP persistent connection).

Next, asyncio, multiprocessing or multithreading are available to parallelize the workload. Each has tradeoffs respective to the task at hand and which you choose is likely best determined by benchmarking and profiling. This page offers great examples of all three.

For the purposes of this post, I'll show multithreading. The impact of the GIL shouldn't be too much of a bottleneck because the tasks are mostly IO-bound, consisting of babysitting requests on the air to wait for the response. When a thread is blocked on IO, it can yield to a thread parsing HTML or doing other CPU-bound work.

Here's the code:

import os
import re
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

def download_pgn(task):
    session, host, page, destination_path = task
    response = session.get(host + page)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "lxml")
    game_url = host + soup.find("a", text="download").get("href")
    filename = re.search(r"\w+\.pgn", game_url).group()
    path = os.path.join(destination_path, filename)
    response = session.get(game_url, stream=True)
    response.raise_for_status()

    with open(path, "wb") as f:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)

if __name__ == "__main__":
    host = "http://www.chessgames.com"
    url_to_scrape = host + "/perl/chesscollection?cid=1014492"
    destination_path = "pgns"
    max_workers = 8

    if not os.path.exists(destination_path):
        os.makedirs(destination_path)

    with requests.Session() as session:
        response = session.get(url_to_scrape)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "lxml")
        pages = soup.find_all("a", href=re.compile(r".*chessgame\?.*"))
        tasks = [
            (session, host, page.get("href"), destination_path) 
            for page in pages
        ]

        with ThreadPoolExecutor(max_workers=max_workers) as pool:
            pool.map(download_pgn, tasks)

I used response.iter_content here which is unnecessary on such tiny text files, but is a generalization so the code will handle larger files in a memory-friendly way.

Results from a rough benchmark (the first request is a bottleneck):

max workers session? seconds
1 no 126
1 yes 111
8 no 24
8 yes 22
32 yes 16
like image 39
ggorlen Avatar answered Sep 25 '22 06:09

ggorlen