Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fetch a Wikipedia article with Python

I try to fetch a Wikipedia article with Python's urllib:

f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes")            s = f.read() f.close() 

However instead of the html page I get the following response: Error - Wikimedia Foundation:

Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to () Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT  

Wikipedia seems to block request which are not from a standard browser.

Anybody know how to work around this?

like image 282
dkp Avatar asked Sep 23 '08 09:09

dkp


People also ask

How do I extract information from Wikipedia?

Just extract Wikipedia data via Google Spreadsheets, download all the data from the sheet to your laptop, and open it in Excel or LibreOffice. Google AdWords Keyword Planner suggests keywords with the commercial or transactional intent, unless you dig deep and use highly specific keywords in the input.

Can we scrape data from Wikipedia?

This is a fun gimmick and Wikipedia is pretty lenient when it comes to web scraping. There are also harder to scrape websites such as Amazon or Google. If you want to scrape such a website, you should set up a system with headless Chrome browsers and proxy servers.


2 Answers

You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent.

Straight from the examples

import urllib2 opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes') page = infile.read() 
like image 55
Florian Bösch Avatar answered Oct 11 '22 09:10

Florian Bösch


It is not a solution to the specific problem. But it might be intersting for you to use the mwclient library (http://botwiki.sno.cc/wiki/Python:Mwclient) instead. That would be so much easier. Especially since you will directly get the article contents which removes the need for you to parse the html.

I have used it myself for two projects, and it works very well.

like image 29
Hannes Ovrén Avatar answered Oct 11 '22 11:10

Hannes Ovrén