I try to fetch a Wikipedia article with Python's urllib: <pre class="prettyprint"><code>f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes") s = f.read() f.close() </code></pre> However instead of the html page I get the following response: Error - Wikimedia Foundation: <pre class="prettyprint"><code>Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to () Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT </code></pre> Wikipedia seems to block request which are not from a standard browser. Anybody know how to work around this?

You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent. Straight from the examples <pre class="prettyprint"><code>import urllib2 opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes') page = infile.read() </code></pre>

Fetch a Wikipedia article with Python

Tags:

python

http-status-code-403

urllib2

user-agent

wikipedia

I try to fetch a Wikipedia article with Python's urllib:

f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes")            s = f.read() f.close()

However instead of the html page I get the following response: Error - Wikimedia Foundation:

Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to () Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT

Wikipedia seems to block request which are not from a standard browser.

Anybody know how to work around this?

282

asked Sep 23 '08 09:09

dkp

2 Answers

You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent.

Straight from the examples

import urllib2 opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes') page = infile.read()

answered Oct 11 '22 09:10

Florian Bösch

It is not a solution to the specific problem. But it might be intersting for you to use the mwclient library (http://botwiki.sno.cc/wiki/Python:Mwclient) instead. That would be so much easier. Especially since you will directly get the article contents which removes the need for you to parse the html.

I have used it myself for two projects, and it works very well.

answered Oct 11 '22 11:10

Hannes Ovrén

Related questions
                            
                                How to serialize Django queryset.values() into json?
                            
                                Where to use yield in Python best?
                            
                                How to straighten a rotated rectangle area of an image using OpenCV in Python?
                            
                                How do I iterate through a string in Python?
                            
                                How can I list or discover queues on a RabbitMQ exchange using python?
                            
                                Finding a key recursively in a dictionary
                            
                                Count the uppercase letters in a string with Python
                            
                                Can I pickle a python dictionary into a sqlite3 text field?
                            
                                Python functions with multiple parameter brackets
                            
                                Equivalent Javascript Functions for Python's urllib.quote() and urllib.unquote()
                            
                                Case insensitive dictionary search? [duplicate]
                            
                                Quick and easy: trayicon with python?
                            
                                Converting int arrays to string arrays in numpy without truncation
                            
                                Database does not update automatically with MySQL and Python
                            
                                How to just call a command and not get its output [duplicate]
                            
                                Recursive diff of two dictionaries (keys and values)?
                            
                                Python String to Int Or None
                            
                                How to see pip package sizes installed?
                            
                                Django viewset has not attribute 'get_extra_actions'
                            
                                How to subclass str in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With