I'm indexing PDFs with Solr using the ExtractingRequestHandler. I would like to display the page number along with hits in a document, e.g. "term <code>foo</code> was found in <code>bar.pdf</code> on pages 2, 3 and 5." Is it possible to include page numbers in the query result like this?

It would require some development effort, but you could achieve this by indexing each page of each document as a seperate Solr document, and then use field collapsing to group the different page hits for each document. Note that you need a nightly for this, field collapsing is not implemented in any currently released Solr version. Also note: Field Collapsing is implemented in version Solr 3.3. More updates are expected in the next big version ( Solr 4.0)

Indexing PDF with page numbers with Solr

1 Answers

It would require some development effort, but you could achieve this by indexing each page of each document as a seperate Solr document, and then use field collapsing to group the different page hits for each document.

Note that you need a nightly for this, field collapsing is not implemented in any currently released Solr version.

Also note: Field Collapsing is implemented in version Solr 3.3. More updates are expected in the next big version ( Solr 4.0)

106

answered Oct 12 '22 07:10

Karl Johansson

Related questions
                            
                                Verifying a Digitally Signed PDF in Python
                            
                                How to bold a text in PDF?
                            
                                Read PDF file in a new tab of same browser
                            
                                The method getInstance(byte[]) is undefined for the type Document.. Android
                            
                                Easiest way to detect that a PDF is encrypted with PHP
                            
                                How to fill in radio button with iTextSharp
                            
                                How does PDF line width interact with the CTM in both horizontal and vertical dimensions?
                            
                                knitr: Saving graphs in both pdf and png format but using pdf files in the final document
                            
                                Merging two PDFs
                            
                                How to sign an InputStream from a PDF file with PDFBox 2.0.0
                            
                                How to detect color from PDF Python
                            
                                File not found error after selecting a file in android
                            
                                PHP - Protect PDF file being access by direct link
                            
                                ghostscript: convert PDF into CMYK preserving pure Black for text
                            
                                I want to scrape a Hindi(Indian Langage) pdf file with python
                            
                                Render PDF using DocumentViewer control?
                            
                                Simple PDF created with iTextSharp cannot be opened by Acrobat Reader?
                            
                                TCPDF QR Code is different all the time
                            
                                Does android have a built-in PDF viewer?
                            
                                How to convert PDF files to spreadsheets [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Indexing PDF with page numbers with Solr

Tags:

full-text-search

pdf

solr

apache-tika

solr-cell

Daniel Hepper

People also ask

1 Answers

Karl Johansson

Recent Activity

Donate For Us