Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to index a .PDF file in ElasticSearch

I am new to ElasticSearch. I have gone through very basic tutorial on creating Indexes. I do understand the concept of a indexing. I want ElasticSearch to search inside a .PDF File. Based on my understanding of creating Indexes, it seems I need to read the .PDF file and extract all the keywords for indexing. But, I do not understand what steps I need to follow. How do I read .PFD file to extract keywords.

like image 710
KurioZ7 Avatar asked Jan 18 '16 14:01

KurioZ7


People also ask

How do I add a PDF to Elasticsearch?

Elasticsearch has the JSON object so use FPDF() library to create a new PDF file from the PDF. A cluster in Elasticsearch holds the encoded data from the PDF file. Use FPDF to create a new instance pdf .

Is there a way to index a PDF?

Add an index to a PDFWith the document open in Acrobat, choose Tools > Index. The Index toolset is displayed in the secondary toolbar. In the secondary toolbar, click Manage Embedded Index. In the Manage Embedded Index dialog box, click Embed Index.

How a document is indexed in Elasticsearch?

Elasticsearch uses a data structure called an inverted index that supports very fast full-text searches. An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in.


1 Answers

It seems that the elasticsearch-mapper-attachment plugin has been deprecated in 5.0.0 (Released Oct. 26th, 2016). The documentation recommends using the Ingest Attachment Processor Plugin as a replacement.

To install:

sudo bin/elasticsearch-plugin install ingest-attachment 

See How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin? for information on how to use the Ingest Attachment plugin.

like image 57
Ben.12 Avatar answered Oct 05 '22 17:10

Ben.12