Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexing Wikipedia dump to elasticsearch gets XML document structures must start and end within the same entity error

I want to index wikipedia to elasticsearch.

I tried stream2es + elasticsearch 2.0.0 and Wikipedia River Plugin 2.6.0 + elasticsearch 1.6.0 to index latest wikipedia dump https://dumps.wikimedia.org/enwiki/20151102/enwiki-20151102-pages-articles-multistream.xml.bz2.

However both got the same error message:

XML document structures must start and end within the same entity.
like image 854
Yuze Avatar asked Dec 14 '22 10:12

Yuze


1 Answers

I'm not sure how to make the XML imports work, but there is another option. Recently wikimedia has made available dumps of the production elasticsearch indices.

The indices are exported every week and for each wiki there are two exports.

  • The content index, which contains only article pages: http://dumps.wikimedia.org/other/cirrussearch/20151116/enwiki-20151116-cirrussearch-content.json.gz
  • The general index, containing all pages. This includes talk pages, templates, etc: http://dumps.wikimedia.org/other/cirrussearch/20151116/enwiki-20151116-cirrussearch-general.json.gz

These are formatted for the elasticsearch bulk import API. Because that is JSON these are also usable outside elasticsearch.

Importing them is not documented yet, but i do roughly the following:

  1. Fetch the current mapping: curl https://en.wikipedia.org/w/api.php?action=cirrus-mapping-dump&format=json > mapping.json
  2. Feed that mapping into elasticsearch: jq .content < mapping.json | curl -XPUT localhost:9200/enwiki_content --data @-
  3. Load the dump: zcat enwiki-20151116-cirrussearch-general.json.gz | parallel --pipe -L 2 -N 2000 -j3 'curl -s http://localhost:9200/enwiki_content/_bulk --data-binary @- > /dev/null'
like image 125
Erik B Avatar answered May 09 '23 05:05

Erik B