Can anyone point me to a tutorial. My main experience with Solr is indexing CSV files. But I cannot find any simple instructions/tutorial to tell me what I need to do to index pdfs. I have seen this: http://wiki.apache.org/solr/ExtractingRequestHandler But it makes very little sense to me. Do I need to install Tika? Im lost - please help

With solr-4.9 (the latest version as of now), extracting data from rich documents like pdfs, spreadsheets(xls, xlxs family), presentations(ppt, ppts), documentation(doc, txt etc) has become fairly simple. The sample code examples provided in the downloaded archive from here contains a basic solr template project to get you started quickly. The necessary configuration changes are as follows: <ol> <li> Change the <code>solrConfig.xml</code> to include following lines : <code><lib dir="<path_to_extraction_libs>" regex=".*\.jar" /> <lib dir="<path_to_solr_cell_jar>" regex="solr-cell-\d.*\.jar" /></code> </li> </ol> create a request handler as follows: <code><requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults" /> </requestHandler></code> 2.Add the necessary jars from the solrExample to your project. 3.Define the schema as per your needs and fire a query like : <code>curl "http://localhost:8983/solr/collection1/update/extract?literal.id=1&literal.filename=testDocToExtractFrom.txt&literal.created_at=2014-07-22+09:50:12.234&commit=true" -F "myfile=@testDocToExtractFrom.txt" </code> go to the GUI portal and query to see the indexed contents. Let me know if you face any problems.

Indexing PDF with Solr

1 Answers

With solr-4.9 (the latest version as of now), extracting data from rich documents like pdfs, spreadsheets(xls, xlxs family), presentations(ppt, ppts), documentation(doc, txt etc) has become fairly simple. The sample code examples provided in the downloaded archive from here contains a basic solr template project to get you started quickly.

The necessary configuration changes are as follows:

Change the solrConfig.xml to include following lines :

<lib dir="<path_to_extraction_libs>" regex=".*\.jar" /> <lib dir="<path_to_solr_cell_jar>" regex="solr-cell-\d.*\.jar" />

create a request handler as follows:

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults" /> </requestHandler>

2.Add the necessary jars from the solrExample to your project.

3.Define the schema as per your needs and fire a query like :

curl "http://localhost:8983/solr/collection1/update/extract?literal.id=1&literal.filename=testDocToExtractFrom.txt&literal.created_at=2014-07-22+09:50:12.234&commit=true" -F "[email protected]"

go to the GUI portal and query to see the indexed contents.

Let me know if you face any problems.

179

answered Oct 18 '22 19:10

Raj Saxena

Related questions
                            
                                Query multiple collections with different fields in solr
                            
                                Querying Solr without specifying field names
                            
                                How to get facet ranges in solr results?
                            
                                How to sort two fields in Solr 3.6
                            
                                How does Sunspot modify Solr's schema.xml? Does it modify it at all?
                            
                                yii2 composer update fatal error
                            
                                Restricting IP addresses for Jetty and Solr
                            
                                How to run Solr 4 in Tomcat locally?
                            
                                Solr "Undefined field text"
                            
                                How install Solr on mac using homebrew?
                            
                                Very slow Solr performance when highlighting
                            
                                Difference(s) between Solr's Cursor and ElasticSearch's Scroll
                            
                                Document Similarity in ElasticSearch
                            
                                How to clear the cache in Solr?
                            
                                lucene Fields vs. DocValues
                            
                                Removing Solr duplicate values into multivalued field
                            
                                Searching names with Apache Solr
                            
                                Running Solr in memory?
                            
                                Scrapy Vs Nutch [closed]
                            
                                Solr - Query over all fields best practice

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Indexing PDF with Solr

Tags:

full-text-search

solr

apache-tika

solrj

solr-cell

Mark

People also ask

1 Answers

Raj Saxena

Recent Activity

Donate For Us