Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a parser/way available to parser Wikipedia dump files using Python?

I have a project where I collect all the Wikipedia articles belonging to a particular category, pull out the dump from Wikipedia, and put it into our db.

So I should be parsing the Wikipedia dump file to get the stuff done. Do we have an efficient parser to do this job? I am a python developer. So I prefer any parser in python. If not suggest one and I will try to write a port of it in python and contribute it to the web, so other persons make use of it or at least try it.

So all I want is a python parser to parse Wikipedia dump files. I started writing a manual parser which parses each node and gets the stuff done.

like image 396
None-da Avatar asked Mar 19 '09 09:03

None-da


3 Answers

There is example code for the same at http://jjinux.blogspot.com/2009/01/python-parsing-wikipedia-dumps-using.html

like image 191
Swaroop C H Avatar answered Sep 21 '22 16:09

Swaroop C H


Another good module is mwlib from here - it is a pain to install with all dependencies (at least on Windows), but it works well.

like image 25
PhilS Avatar answered Sep 22 '22 16:09

PhilS


I don't know about licensing, but this is implemented in python, and includes the source.

like image 21
James L Avatar answered Sep 23 '22 16:09

James L