I would like to be able to obtain relatively up-to-date static HTML files from the enormous (even when compressed) English Wikipedia XML dump file enwiki-latest-pages-articles.xml.bz2 I downloaded from the WikiMedia dump page. There seem to be quite a few tools available, although the documentation on them is pretty scant, so I don't know what most of them do or if they're up-to-date with the latest dumps. (I'm rather good at building web crawlers that can crawl through relatively small HTML pages/files, although I'm awful with SQL and XML, and I don't expect to be very good with either for at least another year.) I want to be able to crawl through HTML files obtained from a dump offline without resorting to crawling Wikipedia online.
Does anyone know of a good tool to obtain static HTML files from recent Wikipedia XML dumps?
First, import the data. Then create the HTML files with DumpHTML. Although simple in theory, this process could be complicated in practice due to the volume of data involved and DumpHTML being a bit neglected, so don't hesitate to ask for help.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With