For example using this Wikipedia dump:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=lebron%20james&rvprop=content&redirects=true&format=xmlfm
Is there an existing library for Python that I can use to create an array with the mapping of subjects and values?
For example:
{height_ft,6},{nationality, American}
Instead, we can access a dump of all of Wikipedia through Wikimedia at dumps.wikimedia.org. (A dump refers to a periodic snapshot of a database). The English version is at dumps.wikimedia.org/enwiki.
Note: Remember, Wikipedia is huge. Downloading the entire thing will take a while, and you'll need at least 150 gigabytes of storage free.
It looks like you really want to be able to parse MediaWiki markup. There is a python library designed for this purpose called mwlib. You can use python's built-in XML packages to extract the page content from the API's response, then pass that content into mwlib's parser to produce an object representation that you can browse and analyse in code to extract the information you want. mwlib is BSD licensed.
Just stumbled over a library on PyPi, wikidump, that claims to provide
Tools to manipulate and extract data from wikipedia dumps
I didn't use it yet, so you are on your own to try it...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With