Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexing PDF with Solr

Can anyone point me to a tutorial.

My main experience with Solr is indexing CSV files. But I cannot find any simple instructions/tutorial to tell me what I need to do to index pdfs.

I have seen this: http://wiki.apache.org/solr/ExtractingRequestHandler

But it makes very little sense to me. Do I need to install Tika?

Im lost - please help

like image 260
Mark Avatar asked Jul 14 '11 13:07

Mark


People also ask

Can a PDF be indexed?

A: Generally we can index textual content (written in any language) from PDF files that use various kinds of character encodings, provided they're not password protected or encrypted. If the text is embedded as images, we may process the images with OCR algorithms to extract the text.

How do I search a PDF for indexing?

Using Windows Search To Search Inside PDF FilesHead to Control Panel > Indexing Options and click on Advanced. Select the File Types tab on the following screen and look for pdf in the list. Tick-mark the box for pdf. Then enable the Index Properties and File Contents option and click on OK.

Can Solr index Word documents?

A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV) files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF.


1 Answers

With solr-4.9 (the latest version as of now), extracting data from rich documents like pdfs, spreadsheets(xls, xlxs family), presentations(ppt, ppts), documentation(doc, txt etc) has become fairly simple. The sample code examples provided in the downloaded archive from here contains a basic solr template project to get you started quickly.

The necessary configuration changes are as follows:

  1. Change the solrConfig.xml to include following lines :

    <lib dir="<path_to_extraction_libs>" regex=".*\.jar" /> <lib dir="<path_to_solr_cell_jar>" regex="solr-cell-\d.*\.jar" />

create a request handler as follows:

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults" /> </requestHandler>

2.Add the necessary jars from the solrExample to your project.

3.Define the schema as per your needs and fire a query like :

curl "http://localhost:8983/solr/collection1/update/extract?literal.id=1&literal.filename=testDocToExtractFrom.txt&literal.created_at=2014-07-22+09:50:12.234&commit=true" -F "[email protected]"

go to the GUI portal and query to see the indexed contents.

Let me know if you face any problems.

like image 179
Raj Saxena Avatar answered Oct 18 '22 19:10

Raj Saxena