Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Download .xls files from a webpage using Python and BeautifulSoup

I want to download all the .xls or .xlsx or .csv from this website into a specified folder.

https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009

I have looked into mechanize, beautiful soup, urllib2 etc. Mechanize does not work in Python 3, urllib2 also had problems with Python 3, I looked for workaround but I couldn't. So, I am currently trying to make it work using Beautiful Soup.

I found some example code and attempted to modify it to suit my problem, as follows -

from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin

url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009/'
u = urlopen(url)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
    href = link.get('href')
    if href.startswith('javascript:'):
        continue
    filename = href.rsplit('/', 1)[-1]
    href = urljoin(url, quote(href))
    try:
        urlretrieve(href, filename)
    except:
        print('failed to download')

However, when run this code does not extract the files from the target page, nor output any failure message (e.g. 'failed to download').

  • How can I use BeautifulSoup to select the Excel files from the page?
  • How can I download these files to a local file using Python?
like image 562
Anubhav Dikshit Avatar asked Jan 06 '16 12:01

Anubhav Dikshit


2 Answers

The issues with your script as it stand are:

  1. The url has a trailing / which gives an invalid page when requested, not listing the files you want to download.
  2. The CSS selector in soup.select(...) is selecting div with the attribute webpartid which does not exist anywhere in that linked document.
  3. You are joining the URL and quoting it, even though the links are given in the page as absolute URLs and they do not need quoting.
  4. The try:...except: block is stopping you seeing the errors generated when trying to download the file. Using an except block without a specific exception is bad practise and should be avoided.

A modified version of your code that will get the correct files and attempt to download them is as follows:

from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin

# Remove the trailing / you had, as that gives a 404 page
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
u = urlopen(url)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")

# Select all A elements with href attributes containing URLs starting with http://
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')

    # Make sure it has one of the correct extensions
    if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
        continue

    filename = href.rsplit('/', 1)[-1]
    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")

However, if you run this you'll notice that a urllib.error.HTTPError: HTTP Error 403: Forbidden exception is thrown, even though the file is downloadable in the browser. At first I thought this was a referral check (to prevent hotlinking), however if you watch at the request in your browser (e.g. Chrome Developer tools) you'll notice that the initial http:// request is blocked there also, and then Chrome attempts a https:// request for the same file.

In other words, the request must go via HTTPS to work (despite what the URLs in the page say). To fix this you will need to rewrite the http: to https: before using the URL for the request. The following code will correctly modify the URLs and download the files. I've also added an variable to specify the output folder, which is added to the filename using os.path.join:

import os
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve

URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = ''  # path to output folder, '.' or '' uses current folder

u = urlopen(URL)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')
    if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
        continue

    filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])

    # We need a https:// URL for this site
    href = href.replace('http://','https://')

    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")
like image 185
mfitzp Avatar answered Oct 25 '22 15:10

mfitzp


I found this to be a good working example, using the BeautifulSoup4, requests, and wget modules for Python 2.7:

import requests
import wget
import os

from bs4 import BeautifulSoup, SoupStrainer

url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'

file_types = ['.xls', '.xlsx', '.csv']

for file_type in file_types:

    response = requests.get(url)

    for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
        if link.has_attr('href'):
            if file_type in link['href']:
                full_path = url + link['href']
                wget.download(full_path)
like image 27
Blairg23 Avatar answered Oct 25 '22 13:10

Blairg23