Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's a fast way to parse a Wikipedia XML dump for article content and populate a MySQL database?

For some text-mining applications, I need to identify the frequency of every word per article in the English-language Wikipedia and populate a MySQL database with that data. This official page suggests using mwdumper or xml2sql on the dump, but they don't directly serve my purpose (unless someone can explain how they can).

Using WikiExtractor, MySQLdb for Python, and a local MySQL server, on the other hand, allows me to do exactly what I want, but it's slow to the point that it would take a month to parse the entire dump. Profiling the modified WikiExtractor program shows that most of the runtime is spent in its nest regular expression search and my database inserts.

Ideally, I don't want processing the articles to take more than a couple days. How can I do it efficiently?

like image 730
rkabra Avatar asked Nov 03 '22 10:11

rkabra


1 Answers

The Perl package MediaWiki::DumpFile is good for parsing. To load a dump and read each page you need very few lines of code.

To do a simple word-frequency calculation you can use the sample code in Perl FAQ or the package Text::Ngrams for something smarter.

Adding the results to a database is up to you, because you are developing the application and you are supposed to know the needs.

like image 126
Amir E. Aharoni Avatar answered Nov 15 '22 06:11

Amir E. Aharoni