The easy way to resolve the error is by passing a valid user-agent as a header parameter, as shown below. Alternatively, you can even set a timeout if you are not getting the response from the website. Python will raise a socket exception if the website doesn't respond within the mentioned timeout period.
The HTTP 403 Forbidden response status code indicates that the server understands the request but refuses to authorize it. This status is similar to 401 , but for the 403 Forbidden status code re-authenticating makes no difference.
This is probably because of mod_security
or some similar server security feature which blocks known spider/bot user agents (urllib
uses something like python urllib/3.3.0
, it's easily detected). Try setting a known browser user agent with:
from urllib.request import Request, urlopen
req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
This works for me.
By the way, in your code you are missing the ()
after .read
in the urlopen
line, but I think that it's a typo.
TIP: since this is exercise, choose a different, non restrictive site. Maybe they are blocking urllib
for some reason...
Definitely it's blocking because of your use of urllib based on the user agent. This same thing is happening to me with OfferUp. You can create a new class called AppURLopener which overrides the user-agent with Mozilla.
import urllib.request
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
opener = AppURLopener()
response = opener.open('http://httpbin.org/user-agent')
Source
"This is probably because of mod_security or some similar server security feature which blocks known
spider/bot
user agents (urllib uses something like python urllib/3.3.0, it's easily detected)" - as already mentioned by Stefano Sanfilippo
from urllib.request import Request, urlopen
url="https://stackoverflow.com/search?q=html+error+403"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
The web_byte is a byte object returned by the server and the content type present in webpage is mostly utf-8. Therefore you need to decode web_byte using decode method.
This solves complete problem while I was having trying to scrape from a website using PyCharm
P.S -> I use python 3.4
Based on previous answers this has worked for me with Python 3.7 by increasing the timeout to 10.
from urllib.request import Request, urlopen
req = Request('Url_Link', headers={'User-Agent': 'XYZ/3.0'})
webpage = urlopen(req, timeout=10).read()
print(webpage)
Since the page works in browser and not when calling within python program, it seems that the web app that serves that url recognizes that you request the content not by the browser.
Demonstration:
curl --dump-header r.txt http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1
...
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access ...
</HTML>
and the content in r.txt has status line:
HTTP/1.1 403 Forbidden
Try posting header 'User-Agent' which fakes web client.
NOTE: The page contains Ajax call that creates the table you probably want to parse. You'll need to check the javascript logic of the page or simply using browser debugger (like Firebug / Net tab) to see which url you need to call to get the table's content.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With