Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexing PDF with page numbers with Solr

I'm indexing PDFs with Solr using the ExtractingRequestHandler. I would like to display the page number along with hits in a document, e.g. "term foo was found in bar.pdf on pages 2, 3 and 5."

Is it possible to include page numbers in the query result like this?

like image 261
Daniel Hepper Avatar asked Nov 04 '10 06:11

Daniel Hepper


People also ask

Can PDF be indexed?

You can index not only the document text, but also bookmarks, comments, attachments, digital signatures, form fields, metadata, and other custom document properties. You can build an index file from all the PDF files in a set of folders you define.

How do I run Solr indexing?

Start the Server If you didn't start Solr after installing it, you can start it by running bin/solr from the Solr directory. If you are running Windows, you can start Solr by running bin\solr. cmd instead. This will start Solr in the background, listening on port 8983.

Can Solr index Word documents?

A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV) files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF.


1 Answers

It would require some development effort, but you could achieve this by indexing each page of each document as a seperate Solr document, and then use field collapsing to group the different page hits for each document.

Note that you need a nightly for this, field collapsing is not implemented in any currently released Solr version.

Also note: Field Collapsing is implemented in version Solr 3.3. More updates are expected in the next big version ( Solr 4.0)

like image 106
Karl Johansson Avatar answered Oct 12 '22 07:10

Karl Johansson