I want to download all the .xls
or .xlsx
or .csv
from this website into a specified folder.
https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009
I have looked into mechanize, beautiful soup, urllib2 etc. Mechanize does not work in Python 3, urllib2 also had problems with Python 3, I looked for workaround but I couldn't. So, I am currently trying to make it work using Beautiful Soup.
I found some example code and attempted to modify it to suit my problem, as follows -
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009/'
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
href = link.get('href')
if href.startswith('javascript:'):
continue
filename = href.rsplit('/', 1)[-1]
href = urljoin(url, quote(href))
try:
urlretrieve(href, filename)
except:
print('failed to download')
However, when run this code does not extract the files from the target page, nor output any failure message (e.g. 'failed to download').
The issues with your script as it stand are:
url
has a trailing /
which gives an invalid page when requested, not listing the files you want to download.soup.select(...)
is selecting div
with the attribute webpartid
which does not exist anywhere in that linked document.try:...except:
block is stopping you seeing the errors generated when trying to download the file. Using an except
block without a specific exception is bad practise and should be avoided.A modified version of your code that will get the correct files and attempt to download them is as follows:
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin
# Remove the trailing / you had, as that gives a 404 page
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
# Select all A elements with href attributes containing URLs starting with http://
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
# Make sure it has one of the correct extensions
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = href.rsplit('/', 1)[-1]
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
However, if you run this you'll notice that a urllib.error.HTTPError: HTTP Error 403: Forbidden
exception is thrown, even though the file is downloadable in the browser.
At first I thought this was a referral check (to prevent hotlinking), however if you watch at the request in your browser (e.g. Chrome Developer tools) you'll notice that
the initial http://
request is blocked there also, and then Chrome attempts a https://
request for the same file.
In other words, the request must go via HTTPS to work (despite what the URLs in the page say). To fix this you will need to rewrite the http:
to https:
before using the URL for the request. The following code will correctly modify the URLs and download the files. I've also added an variable to specify the output folder, which is added to the filename using os.path.join
:
import os
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve
URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = '' # path to output folder, '.' or '' uses current folder
u = urlopen(URL)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])
# We need a https:// URL for this site
href = href.replace('http://','https://')
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
I found this to be a good working example, using the BeautifulSoup4
, requests
, and wget
modules for Python 2.7:
import requests
import wget
import os
from bs4 import BeautifulSoup, SoupStrainer
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
file_types = ['.xls', '.xlsx', '.csv']
for file_type in file_types:
response = requests.get(url)
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path = url + link['href']
wget.download(full_path)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With