I making a python script for personal use but it's not working for wikipedia...
This work:
import urllib2, sys
from bs4 import BeautifulSoup
site = "http://youtube.com"
page = urllib2.urlopen(site)
soup = BeautifulSoup(page)
print soup
This not work:
import urllib2, sys
from bs4 import BeautifulSoup
site= "http://en.wikipedia.org/wiki/StackOverflow"
page = urllib2.urlopen(site)
soup = BeautifulSoup(page)
print soup
This is the error:
Traceback (most recent call last):
File "C:\Python27\wiki.py", line 5, in <module>
page = urllib2.urlopen(site)
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 406, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 444, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 378, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden
You can try the following steps in order to resolve the 403 error in the browser try refreshing the page, rechecking the URL, clearing the browser cookies, check your user credentials.
The HTTP 403 Forbidden response status code indicates that the server understands the request but refuses to authorize it. This status is similar to 401 , but for the 403 Forbidden status code re-authenticating makes no difference.
Within the current code:
import urllib2, sys
from BeautifulSoup import BeautifulSoup
site= "http://en.wikipedia.org/wiki/StackOverflow"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
site= "http://en.wikipedia.org/wiki/StackOverflow"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(site,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
print(soup)
from selenium import webdriver as driver
browser = driver.PhantomJS()
p = browser.get("http://en.wikipedia.org/wiki/StackOverflow")
assert "Stack Overflow - Wikipedia" in browser.title
The reason modified version works is because Wikipedia checks for User-Agent to be of "popular browser"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With