Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a Wikipedia dump

For example using this Wikipedia dump:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=lebron%20james&rvprop=content&redirects=true&format=xmlfm

Is there an existing library for Python that I can use to create an array with the mapping of subjects and values?

For example:

{height_ft,6},{nationality, American}
like image 816
tomwu Avatar asked Aug 11 '10 22:08

tomwu


People also ask

How do you read a Wikipedia dump?

Instead, we can access a dump of all of Wikipedia through Wikimedia at dumps.wikimedia.org. (A dump refers to a periodic snapshot of a database). The English version is at dumps.wikimedia.org/enwiki.

How much storage does Wikipedia use?

Note: Remember, Wikipedia is huge. Downloading the entire thing will take a while, and you'll need at least 150 gigabytes of storage free.


2 Answers

It looks like you really want to be able to parse MediaWiki markup. There is a python library designed for this purpose called mwlib. You can use python's built-in XML packages to extract the page content from the API's response, then pass that content into mwlib's parser to produce an object representation that you can browse and analyse in code to extract the information you want. mwlib is BSD licensed.

like image 161
chaos95 Avatar answered Oct 04 '22 10:10

chaos95


Just stumbled over a library on PyPi, wikidump, that claims to provide

Tools to manipulate and extract data from wikipedia dumps

I didn't use it yet, so you are on your own to try it...

like image 44
PhilS Avatar answered Oct 04 '22 09:10

PhilS