Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python urllib getting access denied when browser works

I’m trying to download a CSV file from this site:

http://www.nasdaq.com/screening/companies-by-name.aspx

If I enter this URL in my Chrome browser the csv file download starts immediately, and I get a file with data on a few thousand companies. However, if I use the code below I get a access denied error. There is no login on this page, so what is the Python code doing differently?

from urllib import urlopen

response = urlopen('http://www.nasdaq.com/screening/companies-by-name.aspx?&render=download')
csv = response.read()

# Save the string to a file
csvstr = str(csv).strip("b'")

lines = csvstr.split("\\n")
f = open("C:\Users\Ankit\historical.csv", "w")
for line in lines:
   f.write(line + "\n")
f.close()
like image 606
user3878070 Avatar asked Jul 25 '14 18:07

user3878070


People also ask

How do I fix 403 Forbidden in Python?

The easy way to resolve the error is by passing a valid user-agent as a header parameter, as shown below. Alternatively, you can even set a timeout if you are not getting the response from the website. Python will raise a socket exception if the website doesn't respond within the mentioned timeout period.

What is Urllib error in Python?

URLError – It is raised for the errors in URLs, or errors while fetching the URL due to connectivity, and has a 'reason' property that tells a user the reason of error. HTTPError – It is raised for the exotic HTTP errors, such as the authentication request errors. It is a subclass or URLError.

Which is better Urllib or requests?

True, if you want to avoid adding any dependencies, urllib is available. But note that even the Python official documentation recommends the requests library: "The Requests package is recommended for a higher-level HTTP client interface."

What does Urllib request do in Python?

request is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols.


2 Answers

The user agent headers for urllib2 (and similar urllib) is "Python-urllib/2.7" (replace 2.7 by your version of Python).

You're getting a 403 error because the NASDAQ server doesn't seem to want to send content to this user agent. You can “spoof” the user agent header, and then it downloads successfully. Here’s a minimal example:

import urllib2

DOWNLOAD_URL = 'http://www.nasdaq.com/screening/companies-by-name.aspx?&render=download'

hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}
req = urllib2.Request(DOWNLOAD_URL, headers=hdr)

try:
    page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
    print e.fp.read()

content = page.read()
print content
like image 138
alexwlchan Avatar answered Sep 18 '22 22:09

alexwlchan


Or you can use python-requests

import requests

url = 'http://www.nasdaq.com/screening/companies-by-name.aspx'
params = {'':'', 'render':'download'}
resp = requests.get(url, params=params)
print resp.text
like image 23
Gaurav Jain Avatar answered Sep 19 '22 22:09

Gaurav Jain