Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch Parse Exception error when attempting to index PDF

I'm just getting started with elasticsearch. Our requirement has us needing to index thousands of PDF files and I'm having a hard time getting just ONE of them to index successfully.

Installed the Attachment Type plugin and got response: Installed mapper-attachments.

Followed the Attachment Type in Action tutorial but the process hangs and I don't know how to interpret the error message. Also tried the gist which hangs in the same place.

$ curl -X POST "localhost:9200/test/attachment/" -d json.file 
{"error":"ElasticSearchParseException[Failed to derive xcontent from (offset=0, length=9): [106, 115, 111, 110, 46, 102, 105, 108, 101]]","status":400}

More details:

The json.file contains an embedded Base64 PDF file (as per instructions). The first line of the file appears correct (to me anyway): {"file":"JVBERi0xLjQNJeLjz9MNCjE1OCAwIG9iaiA8...

I'm not sure if maybe the json.file is invalid or if maybe elasticsearch just isn't set up to parse PDFs properly?!?

Encoding - Here's how we're encoding the PDF into json.file (as per tutorial):

coded=`cat fn6742.pdf | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
json="{\"file\":\"${coded}\"}"
echo "$json" > json.file

also tried:

coded=`openssl base64 -in fn6742.pdf

log:

[2012-06-07 12:32:16,742][DEBUG][action.index             ] [Bailey, Paul] [test][0], node[AHLHFKBWSsuPnTIRVhNcuw], [P], s[STARTED]: Failed to execute [index {[test][attachment][DauMB-vtTIaYGyKD4P8Y_w], source[json.file]}]
org.elasticsearch.ElasticSearchParseException: Failed to derive xcontent from (offset=0, length=9): [106, 115, 111, 110, 46, 102, 105, 108, 101]
    at org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:147)
    at org.elasticsearch.common.xcontent.XContentHelper.createParser(XContentHelper.java:50)
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:451)
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:437)
    at org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:290)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:210)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:680)

Hoping someone can help me see what I'm missing or did wrong?

like image 298
Meltemi Avatar asked Jun 13 '12 14:06

Meltemi


1 Answers

The following error points to the source of the problem.

Failed to derive xcontent from (offset=0, length=9): [106, 115, 111, 110, 46, 102, 105, 108, 101]

The UTF-8 codes [106, 115, 111, ...] show that you are trying to index string "json.file" instead of content of the file.

To index content of the file simply add letter "@" in front of the file name.

curl -X POST "localhost:9200/test/attachment/" -d @json.file
like image 68
imotov Avatar answered Sep 19 '22 00:09

imotov