I want to index wikipedia to elasticsearch.
I tried stream2es + elasticsearch 2.0.0 and Wikipedia River Plugin 2.6.0 + elasticsearch 1.6.0 to index latest wikipedia dump https://dumps.wikimedia.org/enwiki/20151102/enwiki-20151102-pages-articles-multistream.xml.bz2.
However both got the same error message:
XML document structures must start and end within the same entity.
I'm not sure how to make the XML imports work, but there is another option. Recently wikimedia has made available dumps of the production elasticsearch indices.
The indices are exported every week and for each wiki there are two exports.
These are formatted for the elasticsearch bulk import API. Because that is JSON these are also usable outside elasticsearch.
Importing them is not documented yet, but i do roughly the following:
curl https://en.wikipedia.org/w/api.php?action=cirrus-mapping-dump&format=json > mapping.json
jq .content < mapping.json | curl -XPUT localhost:9200/enwiki_content --data @-
zcat enwiki-20151116-cirrussearch-general.json.gz | parallel --pipe -L 2 -N 2000 -j3 'curl -s http://localhost:9200/enwiki_content/_bulk --data-binary @- > /dev/null'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With