I’m trying to download a CSV file from this site:
http://www.nasdaq.com/screening/companies-by-name.aspx
If I enter this URL in my Chrome browser the csv file download starts immediately, and I get a file with data on a few thousand companies. However, if I use the code below I get a access denied error. There is no login on this page, so what is the Python code doing differently?
from urllib import urlopen
response = urlopen('http://www.nasdaq.com/screening/companies-by-name.aspx?&render=download')
csv = response.read()
# Save the string to a file
csvstr = str(csv).strip("b'")
lines = csvstr.split("\\n")
f = open("C:\Users\Ankit\historical.csv", "w")
for line in lines:
f.write(line + "\n")
f.close()
The easy way to resolve the error is by passing a valid user-agent as a header parameter, as shown below. Alternatively, you can even set a timeout if you are not getting the response from the website. Python will raise a socket exception if the website doesn't respond within the mentioned timeout period.
URLError – It is raised for the errors in URLs, or errors while fetching the URL due to connectivity, and has a 'reason' property that tells a user the reason of error. HTTPError – It is raised for the exotic HTTP errors, such as the authentication request errors. It is a subclass or URLError.
True, if you want to avoid adding any dependencies, urllib is available. But note that even the Python official documentation recommends the requests library: "The Requests package is recommended for a higher-level HTTP client interface."
request is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols.
The user agent headers for urllib2
(and similar urllib
) is "Python-urllib/2.7"
(replace 2.7 by your version of Python).
You're getting a 403 error because the NASDAQ server doesn't seem to want to send content to this user agent. You can “spoof” the user agent header, and then it downloads successfully. Here’s a minimal example:
import urllib2
DOWNLOAD_URL = 'http://www.nasdaq.com/screening/companies-by-name.aspx?&render=download'
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}
req = urllib2.Request(DOWNLOAD_URL, headers=hdr)
try:
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.fp.read()
content = page.read()
print content
Or you can use python-requests
import requests
url = 'http://www.nasdaq.com/screening/companies-by-name.aspx'
params = {'':'', 'render':'download'}
resp = requests.get(url, params=params)
print resp.text
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With