Obtaining static HTML files from Wikipedia XML dump

Question

I would like to be able to obtain relatively up-to-date static HTML files from the enormous (even when compressed) English Wikipedia XML dump file enwiki-latest-pages-articles.xml.bz2 I downloaded from the WikiMedia dump page. There seem to be quite a few tools available, although the documentation on them is pretty scant, so I don't know what most of them do or if they're up-to-date with the latest dumps. (I'm rather good at building web crawlers that can crawl through relatively small HTML pages/files, although I'm awful with SQL and XML, and I don't expect to be very good with either for at least another year.) I want to be able to crawl through HTML files obtained from a dump offline without resorting to crawling Wikipedia online.

Does anyone know of a good tool to obtain static HTML files from recent Wikipedia XML dumps?

MaxSem · Accepted Answer

First, import the data. Then create the HTML files with DumpHTML. Although simple in theory, this process could be complicated in practice due to the volume of data involved and DumpHTML being a bit neglected, so don't hesitate to ask for help.

Obtaining static HTML files from Wikipedia XML dump

Tags:

xml-parsing

screen-scraping

web-crawler

wikipedia

mediawiki

Brian Schmitz

1 Answers

MaxSem

Recent Activity

Donate For Us

Obtaining static HTML files from Wikipedia XML dump

Tags:

xml-parsing

screen-scraping

web-crawler

wikipedia

mediawiki

Brian Schmitz

1 Answers

MaxSem

Related questions

Recent Activity

Donate For Us