Indexing Wikipedia dump to elasticsearch gets XML document structures must start and end within the same entity error

Question

I want to index wikipedia to elasticsearch.

I tried stream2es + elasticsearch 2.0.0 and Wikipedia River Plugin 2.6.0 + elasticsearch 1.6.0 to index latest wikipedia dump https://dumps.wikimedia.org/enwiki/20151102/enwiki-20151102-pages-articles-multistream.xml.bz2.

However both got the same error message:

XML document structures must start and end within the same entity.

Erik B · Accepted Answer

I'm not sure how to make the XML imports work, but there is another option. Recently wikimedia has made available dumps of the production elasticsearch indices.

The indices are exported every week and for each wiki there are two exports.

The content index, which contains only article pages: http://dumps.wikimedia.org/other/cirrussearch/20151116/enwiki-20151116-cirrussearch-content.json.gz
The general index, containing all pages. This includes talk pages, templates, etc: http://dumps.wikimedia.org/other/cirrussearch/20151116/enwiki-20151116-cirrussearch-general.json.gz

These are formatted for the elasticsearch bulk import API. Because that is JSON these are also usable outside elasticsearch.

Importing them is not documented yet, but i do roughly the following:

Fetch the current mapping: curl https://en.wikipedia.org/w/api.php?action=cirrus-mapping-dump&format=json > mapping.json
Feed that mapping into elasticsearch: jq .content < mapping.json | curl -XPUT localhost:9200/enwiki_content --data @-
Load the dump: zcat enwiki-20151116-cirrussearch-general.json.gz | parallel --pipe -L 2 -N 2000 -j3 'curl -s http://localhost:9200/enwiki_content/_bulk --data-binary @- > /dev/null'

Indexing Wikipedia dump to elasticsearch gets XML document structures must start and end within the same entity error

Tags:

xml

elasticsearch

wikipedia

Yuze

1 Answers

Erik B

Recent Activity

Donate For Us

Indexing Wikipedia dump to elasticsearch gets XML document structures must start and end within the same entity error

Tags:

xml

elasticsearch

wikipedia

Yuze

1 Answers

Erik B

Related questions

Recent Activity

Donate For Us