Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Obtaining static HTML files from Wikipedia XML dump

I would like to be able to obtain relatively up-to-date static HTML files from the enormous (even when compressed) English Wikipedia XML dump file enwiki-latest-pages-articles.xml.bz2 I downloaded from the WikiMedia dump page. There seem to be quite a few tools available, although the documentation on them is pretty scant, so I don't know what most of them do or if they're up-to-date with the latest dumps. (I'm rather good at building web crawlers that can crawl through relatively small HTML pages/files, although I'm awful with SQL and XML, and I don't expect to be very good with either for at least another year.) I want to be able to crawl through HTML files obtained from a dump offline without resorting to crawling Wikipedia online.

Does anyone know of a good tool to obtain static HTML files from recent Wikipedia XML dumps?

like image 710
Brian Schmitz Avatar asked May 23 '12 04:05

Brian Schmitz


1 Answers

First, import the data. Then create the HTML files with DumpHTML. Although simple in theory, this process could be complicated in practice due to the volume of data involved and DumpHTML being a bit neglected, so don't hesitate to ask for help.

like image 121
MaxSem Avatar answered Sep 19 '22 04:09

MaxSem