Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin?

I'm new to Elasticsearch and I read here https://www.elastic.co/guide/en/elasticsearch/plugins/master/mapper-attachments.html that the mapper-attachments plugin is deprecated in elasticsearch 5.0.0.

I now try to index a pdf file with the new ingest-attachment plugin and upload the attachment.

What I've tried so far is

curl -H 'Content-Type: application/pdf' -XPOST localhost:9200/test/1 -d @/cygdrive/c/test/test.pdf

but I get the following error:

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"not_x_content_exception","reason":"Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"}},"status":400}

I would expect that the pdf file will be indexed and uploaded. What am I doing wrong?

I also tested Elasticsearch 2.3.3 but the mapper-attachments plugin is not valid for this version and I don't want to use any older version of Elasticsearch.

like image 577
7twenty7 Avatar asked Jun 16 '16 13:06

7twenty7


People also ask

What is ingest attachment plugin in Elasticsearch?

The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika. You can use the ingest attachment plugin as a replacement for the mapper attachment plugin.

How do I Index a PDF file in Elasticsearch?

Speak with an Expert for Free Oftentimes, you’ll have PDF files you’ll need to index in Elasticsearch. The attachment processor Elasticsearch works hard to deliver indexing reliability and flexibility for you. To save resources in the process of indexing a PDF file for Elasticsearch, it’s best to run pipelines and use the ingest_attachment method.

What happened to the Elasticsearch-mapper-attachment plugin?

It seems that the elasticsearch-mapper-attachment plugin has been deprecated in 5.0.0 (Released Oct. 26th, 2016). The documentation recommends using the Ingest Attachment Processor Plugin as a replacement. See How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin? for information on how to use the Ingest Attachment plugin.

What is an ingest pipeline in Elasticsearch?

An ingest pipeline is a way of performing additional steps when indexing a document in Elasticsearch. While Elasticsearch comes pre-installed with some pipeline processors (which can perform actions such as removing or adding fields), the attachment plugin installs an additional processor that can be used when defining a pipeline.


1 Answers

You need to make sure you have created your ingest pipeline with:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
      }
    }
  ]
}

Then you can make a PUT not POST to your index using the pipeline you've created.

PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

In your example, should be something like:

curl -H 'Content-Type: application/pdf' -XPUT localhost:9200/test/1?pipeline=attachment -d @/cygdrive/c/test/test.pdf

Remembering that the PDF content must be base64 encoded.

Hope it will help you.

Edit 1 Please make sure to read these, it helped me a lot:

Elastic Ingest

Ingest Plugin

Ingest Presentation

Edit 2

Also, you must have ingest-attachment plugin installed.

./bin/elasticsearch-plugin install ingest-attachment

Edit 3

Please, before you create your ingest processor (attachment), create your index, map with the fields you will use and make sure you have the data field in your map (same name of the "field" in your attachment processor), so ingest will process and fullfill your data field with your pdf content.

I inserted the indexed_chars option in the ingest processor, with -1 value, so you can index large pdf files.

Edit 4

The mapping should be something like that:

PUT my_index
{ 
    "mappings" : { 
        "my_type" : { 
            "properties" : { 
                "attachment.data" : { 
                    "type": "text", 
                    "analyzer" : "brazilian" 
                } 
            } 
        } 
    } 
}

In this case, I use the brazilian filter, but you can remove that or use your own.

like image 184
Evis Avatar answered Oct 05 '22 23:10

Evis