Wikipedias stance is: <blockquote> Data retrieval: Bots may not be used to retrieve bulk content for any use not directly related to an approved bot task. This includes dynamically loading pages from another website, which may result in the website being blacklisted and permanently denied access. If you would like to download bulk content or mirror a project, please do so by downloading or hosting your own copy of our database. </blockquote> That is why Python is blocked. You're supposed to download data dumps. Anyways, you can read pages like this in Python 2: <pre class="prettyprint"><code>req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) con = urllib2.urlopen( req ) print con.read() </code></pre> Or in Python 3: <pre class="prettyprint"><code>import urllib req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) con = urllib.request.urlopen( req ) print(con.read()) </code></pre> To debug this, you'll need to trap that exception. <pre class="prettyprint"><code>try: f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)') except urllib2.HTTPError, e: print e.fp.read() </code></pre> When I print the resulting message, it includes the following <blockquote> "English Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again in a few minutes. " </blockquote> Often times websites will filter access by checking if they are being accessed by a recognised user agent. Wikipedia is just treating your script as a bot and rejecting it. Try spoofing as a browser. The following link takes to you an article to show you how. http://wolfprojects.altervista.org/changeua.php Some websites will block access from scripts to avoid 'unnecessary' usage of their servers by reading the headers urllib sends. I don't know and can't imagine why wikipedia does/would do this, but have you tried spoofing your headers?

Python's `urllib2`: Why do I get error 403 when I `urlopen` a Wikipedia page?

Tags:

urllib2

Wikipedias stance is:

Data retrieval: Bots may not be used to retrieve bulk content for any use not directly related to an approved bot task. This includes dynamically loading pages from another website, which may result in the website being blacklisted and permanently denied access. If you would like to download bulk content or mirror a project, please do so by downloading or hosting your own copy of our database.

That is why Python is blocked. You're supposed to download data dumps.

Anyways, you can read pages like this in Python 2:

req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) 
con = urllib2.urlopen( req )
print con.read()

Or in Python 3:

import urllib
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) 
con = urllib.request.urlopen( req )
print(con.read())

To debug this, you'll need to trap that exception.

try:
    f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
except urllib2.HTTPError, e:
    print e.fp.read()

When I print the resulting message, it includes the following

"English

Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again in a few minutes. "

Often times websites will filter access by checking if they are being accessed by a recognised user agent. Wikipedia is just treating your script as a bot and rejecting it. Try spoofing as a browser. The following link takes to you an article to show you how.

http://wolfprojects.altervista.org/changeua.php

Some websites will block access from scripts to avoid 'unnecessary' usage of their servers by reading the headers urllib sends. I don't know and can't imagine why wikipedia does/would do this, but have you tried spoofing your headers?

Related questions
                            
                                how to use tempfile.NamedTemporaryFile() in python
                            
                                Should I Return None or (None, None)?
                            
                                SQL-like window functions in PANDAS: Row Numbering in Python Pandas Dataframe
                            
                                How to multiply all integers inside list [duplicate]
                            
                                How do I get monotonic time durations in python?
                            
                                abstract test case using python unittest
                            
                                How to properly use python's isinstance() to check if a variable is a number?
                            
                                Get value of an input box using Selenium (Python)
                            
                                Sqlalchemy if table does not exist
                            
                                Is there a way to suppress the messages TensorFlow prints?
                            
                                How to open my files in data_folder with pandas using relative path?
                            
                                What is the oldest time that can be represented in Python?
                            
                                How to import python file located in same subdirectory in a pycharm project
                            
                                Python and SQLite: insert into table
                            
                                How to add random delays between the queries sent to Google to avoid getting blocked in python
                            
                                Elegant way to remove items from sequence in Python? [duplicate]
                            
                                Copying nested lists in Python
                            
                                Prettier default plot colors in matplotlib
                            
                                python logging file is not working when using logging.basicConfig
                            
                                How to compare dates in Django templates

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python's `urllib2`: Why do I get error 403 when I `urlopen` a Wikipedia page?

Tags:

python

http

urllib2

Related questions

Recent Activity

Donate For Us