I'm indexing PDFs with Solr using the ExtractingRequestHandler. I would like to display the page number along with hits in a document, e.g. "term foo
was found in bar.pdf
on pages 2, 3 and 5."
Is it possible to include page numbers in the query result like this?
You can index not only the document text, but also bookmarks, comments, attachments, digital signatures, form fields, metadata, and other custom document properties. You can build an index file from all the PDF files in a set of folders you define.
Start the Server If you didn't start Solr after installing it, you can start it by running bin/solr from the Solr directory. If you are running Windows, you can start Solr by running bin\solr. cmd instead. This will start Solr in the background, listening on port 8983.
A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV) files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF.
It would require some development effort, but you could achieve this by indexing each page of each document as a seperate Solr document, and then use field collapsing to group the different page hits for each document.
Note that you need a nightly for this, field collapsing is not implemented in any currently released Solr version.
Also note: Field Collapsing is implemented in version Solr 3.3. More updates are expected in the next big version ( Solr 4.0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With