I'm new to Elasticsearch and I read here https://www.elastic.co/guide/en/elasticsearch/plugins/master/mapper-attachments.html that the mapper-attachments plugin is deprecated in elasticsearch 5.0.0. I now try to index a pdf file with the new ingest-attachment plugin and upload the attachment. What I've tried so far is <pre class="prettyprint"><code>curl -H 'Content-Type: application/pdf' -XPOST localhost:9200/test/1 -d @/cygdrive/c/test/test.pdf </code></pre> but I get the following error: <pre class="prettyprint"><code>{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"not_x_content_exception","reason":"Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"}},"status":400} </code></pre> I would expect that the pdf file will be indexed and uploaded. What am I doing wrong? I also tested Elasticsearch 2.3.3 but the mapper-attachments plugin is not valid for this version and I don't want to use any older version of Elasticsearch.

You need to make sure you have created your ingest pipeline with: <pre class="prettyprint"><code>PUT _ingest/pipeline/attachment { "description" : "Extract attachment information", "processors" : [ { "attachment" : { "field" : "data", "indexed_chars" : -1 } } ] } </code></pre> Then you can make a PUT not POST to your index using the pipeline you've created. <pre class="prettyprint"><code>PUT my_index/my_type/my_id?pipeline=attachment { "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=" } </code></pre> In your example, should be something like: <pre class="prettyprint"><code>curl -H 'Content-Type: application/pdf' -XPUT localhost:9200/test/1?pipeline=attachment -d @/cygdrive/c/test/test.pdf </code></pre> Remembering that the PDF content must be base64 encoded. Hope it will help you. Edit 1 Please make sure to read these, it helped me a lot: Elastic Ingest Ingest Plugin Ingest Presentation Edit 2 Also, you must have ingest-attachment plugin installed. <pre class="prettyprint"><code>./bin/elasticsearch-plugin install ingest-attachment </code></pre> Edit 3 Please, before you create your ingest processor (attachment), create your index, map with the fields you will use and make sure you have the data field in your map (same name of the "field" in your attachment processor), so ingest will process and fullfill your data field with your pdf content. I inserted the indexed_chars option in the ingest processor, with -1 value, so you can index large pdf files. Edit 4 The mapping should be something like that: <pre class="prettyprint"><code>PUT my_index { "mappings" : { "my_type" : { "properties" : { "attachment.data" : { "type": "text", "analyzer" : "brazilian" } } } } } </code></pre> In this case, I use the brazilian filter, but you can remove that or use your own.

How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin?

Tags:

plugins

pdf

elasticsearch

attachment

elasticsearch-plugin

I'm new to Elasticsearch and I read here https://www.elastic.co/guide/en/elasticsearch/plugins/master/mapper-attachments.html that the mapper-attachments plugin is deprecated in elasticsearch 5.0.0.

I now try to index a pdf file with the new ingest-attachment plugin and upload the attachment.

What I've tried so far is

curl -H 'Content-Type: application/pdf' -XPOST localhost:9200/test/1 -d @/cygdrive/c/test/test.pdf

but I get the following error:

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"not_x_content_exception","reason":"Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"}},"status":400}

I would expect that the pdf file will be indexed and uploaded. What am I doing wrong?

I also tested Elasticsearch 2.3.3 but the mapper-attachments plugin is not valid for this version and I don't want to use any older version of Elasticsearch.

577

asked Jun 16 '16 13:06

7twenty7

1 Answers

You need to make sure you have created your ingest pipeline with:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
      }
    }
  ]
}

Then you can make a PUT not POST to your index using the pipeline you've created.

PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

In your example, should be something like:

curl -H 'Content-Type: application/pdf' -XPUT localhost:9200/test/1?pipeline=attachment -d @/cygdrive/c/test/test.pdf

Remembering that the PDF content must be base64 encoded.

Hope it will help you.

Edit 1 Please make sure to read these, it helped me a lot:

Elastic Ingest

Ingest Plugin

Ingest Presentation

Edit 2

Also, you must have ingest-attachment plugin installed.

./bin/elasticsearch-plugin install ingest-attachment

Edit 3

Please, before you create your ingest processor (attachment), create your index, map with the fields you will use and make sure you have the data field in your map (same name of the "field" in your attachment processor), so ingest will process and fullfill your data field with your pdf content.

I inserted the indexed_chars option in the ingest processor, with -1 value, so you can index large pdf files.

Edit 4

The mapping should be something like that:

PUT my_index
{ 
    "mappings" : { 
        "my_type" : { 
            "properties" : { 
                "attachment.data" : { 
                    "type": "text", 
                    "analyzer" : "brazilian" 
                } 
            } 
        } 
    } 
}

In this case, I use the brazilian filter, but you can remove that or use your own.

184

answered Oct 05 '22 23:10

Evis

Related questions
                            
                                Converting MS Word Documents to PDF in ASP.NET [closed]
                            
                                Opening files in browser instead of downloading
                            
                                'PDFsharp cannot handle this PDF feature introduced with Acrobat 6' error while opening PDF file
                            
                                Wicked_PDF templates is missing
                            
                                Edit Metadata of PDF File with C# [closed]
                            
                                How to show page number (N of N) using xslt in PDF Report
                            
                                Fill pdf form with javascript (client-side only)
                            
                                Can prawn generate PDFs with links?
                            
                                How to detect if a file is PDF or TIFF?
                            
                                PyPDF 2 Decrypt Not Working
                            
                                What intent would open a pdf from a url? [duplicate]
                            
                                How to read PDF form data using iTextSharp?
                            
                                Creating a pdf file in android programmatically and writing in it
                            
                                Is it possible to uncompress PDF by using Adobe Acrobat or Acrobat Distiller?
                            
                                I successfully compiled my program. Now how do I run it?
                            
                                Nodejs: Convert Doc to PDF
                            
                                Python: Create automated strictly-designed multi-page .pdf report from .html
                            
                                Watermark in existing PDF in Ruby
                            
                                PDF Document does not display when creating control dynamically
                            
                                pdftk and qpdf to reset PDF commenting security

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With